This article will show how to increase the performance of a rather simple algorithm up to 80%, by showing how we can accept a loss in absolutely accuracy, and also taking advantage of how processors work.

Introduction

When we talk about multiple threads using a shared resource, often the first thing we think of is how to synchronize access to that resource so that it is always in a valid, deterministic state. There are, however, many times when the exact opposite approach is desired. Sometimes it’s appropriate to sacrifice accuracy in favor of performance.

To demonstrate the examples in this article, I developed a small test driver that will put our various test classes through their paces on multiple threads and measure a couple of key factors: elapsed time and error rate. To see the code for this driver, download the accompanying sample project.

Histogram

Recently, I needed to implement something that, reduced to essentials, is a histogram. It needed an array of counts and an integer tracking how many samples had been added. Here’s a simplified version of just the histogram functionality:

class Histogram
{
    protected long sampleCount;
    protected ushort[] sampleValueCounts;

    public virtual long SampleCount { get { return sampleCount; } }
    public ushort[] SampleValueCounts { get { return sampleValueCounts; } }

    public Histogram(int range)
    {
        // + 1 to allow 0 - range, inclusive
        this.sampleValueCounts = new ushort[range + 1];
    }

    public virtual void AddSample(int value)
    {
        ++sampleValueCounts[value];
        ++sampleCount;
    }
}

Pretty simple, right?

Now let’s add the requirement that this needs to be access from multiple threads. How is that going to affect the implementation? How do we ensure that sampleCount and the sum of sampleValueCounts[] stay in sync?

The obvious solution is to add a lock in AddSample:

lock(sampleLock)
{
    ++sampleValueCounts[value];
    ++sampleCount;
}

Let’s add another requirement: performance. Does the locked version of AddSample perform well enough? Maybe—we should absolutely measure and find out. In my case, I knew some additional details in how the system worked and had past experience to think that I may want to reconsider the lock.

I knew that there would be many thousands of instances of the Histogram and that every second, hundreds to thousands of them would need to be updated. This could be very hot code.

AddSample is a very cheap function—two increments! By adding a lock, even one that most likely runs completely in user mode, I may have significantly increased the time it takes to execute this method. We’ll see an actual measurement below, and it’s not quite that bad in this case, but we can do better.

Lock first to Ensure Correctness

You should always try it first with a lock—make it correct first, then optimize. If your performance is fine with the lock, then you don’t need to go any further—just stop and work on something more important.

Before we add a lock, though, let’s see how the basic histogram performs with no locking and multiple threads:

Type: Histogram

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.0959830

SampleCount: 13,641,046 (Error: -86.359%)

Sum: 99,969,354 (Error: -0.031%)

On my 4 core machine, this took about 10 seconds. The sum of the histogram values was very close to the number of values we tried to add—definitely within our tolerance. Unfortunately, SampleCount is way off.

Intuitively, this makes sense. SampleCount is updated on every single sample, so between all the threads, there is ample opportunity to race and obliterate this value. But the array—not so much. Since I’m testing with a random distribution, there is very low likelihood that there will be a conflict in a particular slot.

Let’s see what happens when we add a lock:

class HistogramLocked : Histogram
{
    private object addLock = new object();

    public HistogramLocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        lock (addLock)
        {
            ++sampleValueCounts[value];
            ++sampleCount;
        }
    }
}

Type: HistogramLocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:08.0839173

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 100,000,000 (Error: 0.000%)

Woah, our time decreased! Actually, on my home machine it decreased. On my work machine (which has faster and more processors) the time did actually increase by about 10% (proving the point above: you know nothing until you measure).

In most cases, we should just stop here. We’ve ensured a correct system for very little cost. You would need to have a good reason to try to optimize this further.

Since we saw that there weren’t that many conflicts on the array, what if we just protect the sampleCount increment, and do that with an Interlocked.Increment() method?

class HistogramInterlocked : Histogram
{
    public HistogramInterlocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];
        Interlocked.Increment(ref this.sampleCount);
    }
}

Type: HistogramInterlocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.7659328

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,957,294 (Error: -0.043%)

We may as well just use a lock and get 100% accuracy in both counts.

Approximation can be Good Enough

Want to eke out even more performance? To do this without a lock, you need to give up one attribute: correctness. If there are enough samples, then missing a some is fine. You just need to do two things:

Ensure your algorithm can work in the face of slightly incorrect data
Minimize the error as cheaply as possible.

Once you develop an algorithm that approximates a result, you should be able to measure the error rate to validate that the performance gain is worth the loss of accuracy.

Now let’s see if we can get an approximate SampleCount that’s “good enough.”

Instead of having all threads increment a single value, they can all increment their own value. We “bucketize” the sampleCount variable. When we need the real total, we can just add them up, and, in effect, transfer some of the CPU cost to a more rare operation.

class HistogramThreadBuckets : Histogram
{
    private long[] valueBuckets;

    public long[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i];
            }
            return sum;
        }
    }

    public HistogramThreadBuckets(int range, int valueBuckets) : base(range) 
    {
        this.valueBuckets = new long[valueBuckets];
    }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];

        int bucket = Thread.CurrentThread.ManagedThreadId % this.valueBuckets.Length;
        ++valueBuckets[bucket];            
    }
}

I run this algorithm with multiple bucket counts to see the difference:

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:09.1901657

ValueCount Buckets: 4

SampleCount: 92,941,626 (Error: -7.058%)

Sum: 99,934,033 (Error: -0.066%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.9367826

ValueCount Buckets: 8

SampleCount: 96,263,770 (Error: -3.736%)

Sum: 99,938,926 (Error: -0.061%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.6882437

ValueCount Buckets: 16

SampleCount: 98,274,025 (Error: -1.726%)

Sum: 99,954,021 (Error: -0.046%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:04.6969067

ValueCount Buckets: 32

SampleCount: 99,934,543 (Error: -0.065%)

Sum: 99,936,526 (Error: -0.063%)

Cool, we just cut our time in half! Something about this should seem weird, though. How can this be significantly faster than the naïve algorithm when neither of them use locks? There’s no contention anyway!

Well, actually, there is. These buckets likely exist in the various caches for multiple processors, and when you modify one, a certain amount of communication goes on between CPUs to resolve conflicts and invalidate the caches. The more variables there are, the less likely this will happen.

Know Your Cache Line Size

Can we do even better?

If we know that all our CPUs are going to need to coordinate among cache lines, can we ensure we have NO conflicts between our threads? If we know the size of a cache line and absolutely guaranteed that each thread went to a different cache line, then we could take complete advantage of the CPU caches.

To get the size of a cache line, use one of many CPU-information tools available on the Internet. I used one freeware utility called CPU-Z. In my case, the cache line is 64-bytes, so I just need to ensure that the buckets holding the values are at least 64 bytes long. Since a long is 8 bytes, I can just pad out a struct to do this (or just multiply the size of the array and use only every nth entry).

The other thing we need to do is ensure there are no collisions between threads on the buckets. In the previous code, I just hashed the ManagedThreadId into the buckets, but this isn’t reliable since we don’t know the thread ids that we’ll get. The solution is just to send in our own “thread id” that we can reliably map to a unique bucket.

class HistogramThreadBucketsCacheLine : Histogram
{
    public struct ValueBucket
    {
        private long _junk0;
        private long _junk1;
        private long _junk2;
        
        public long value;

        private long _junk4;
        private long _junk5;
        private long _junk6;
        private long _junk7;            
    };

    private ValueBucket[] valueBuckets;

    public ValueBucket[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i].value;
            }
            return sum;
        }
    }

    public HistogramThreadBucketsCacheLine(int range, int valueBuckets)
        : base(range) 
    {
        this.valueBuckets = new ValueBucket[valueBuckets];
    }

    public override void AddSample(int value, int threadId)
    {
        ++sampleValueCounts[value];

        ++valueBuckets[threadId].value;
    }
}

And the output is:

Type: HistogramThreadBucketsCacheLine

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:02.0215371

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,761,090 (Error: -0.239%)

We more than halved it again! Not too shabby! Smile

Summary

By being willing to accept some loss of accuracy, and taking advantage of how processors work, we were able to reduce the running time of a fairly simple and superficially efficient algorithm by 80%. As with all performance investigations, your mileage may vary, and lots depends on the data your manipulating and the hardware it runs on. Optimizations like these should be quite rare, and only in places you absolutely need them.

Download the sample code project.

Like this tip? Check out my book, C # 4.0 How-To, where you’ll finds hundreds of great tips like this one.

Philosophical Geek

Code and musings by Ben Watson

Concurrency, Performance, Arrays and when Dirty Writes are OK