Monthly Archives: April 2012

Don’t Log Exception Stacks Unless You Can Afford It

A short, simple tip for this week: Don’t log exception stacks in managed applications unless you understand the performance of your system.

If you have a standard desktop client, or some other app where you can tolerate many-millisecond disruptions or a spike in CPU usage, then you don’t have to care about this.

If you are building a high-performance managed system, then you definitely do have to care about this.

The system I work on needs to handle exceptions coming from 3rd-party components. We can’t let the exceptions kills the process, and we can’t swallow them, ignoring the component failure, so we need to log them. The question is—what information do we log?

For managed exceptions, there are three properties that are most generally useful: type of the exception, Message, and StackTrace.

Getting the type and the Message are nearly free, but accessing StackTrace or calling ToString() on the exception object will cause a bunch of reflection to happen to build up a user-friendly stack trace string. If you can do without, go for it. It may be possible to augment the Message property of an exception to give some clues to the problem. Usually, however, this will not be possible.

Since getting the stack trace for an exception is relatively expensive, especially for a program that shouldn’t miss a beat and needs to continue running, many of the managed applications that I work on have a configuration setting to be able to turn off stack trace logging for exceptions. This enables us to either run it turned on until we see a performance problem, or keep it off until there is a reproable problem, when we can turn it on selectively.

Like this tip? Check out my book, C # 4.0 How-To, where you’ll finds hundreds of great tips like this one.

Concurrency, Performance, Arrays and when Dirty Writes are OK

This article will show how to increase the performance of a rather simple algorithm up to 80%, by showing how we can accept a loss in absolutely accuracy, and also taking advantage of how processors work.

Introduction

When we talk about multiple threads using a shared resource, often the first thing we think of is how to synchronize access to that resource so that it is always in a valid, deterministic state. There are, however, many times when the exact opposite approach is desired. Sometimes it’s appropriate to sacrifice accuracy in favor of performance.

To demonstrate the examples in this article, I developed a small test driver that will put our various test classes through their paces on multiple threads and measure a couple of key factors: elapsed time and error rate. To see the code for this driver, download the accompanying sample project.

Histogram

Recently, I needed to implement something that, reduced to essentials, is a histogram. It needed an array of counts and an integer tracking how many samples had been added. Here’s a simplified version of just the histogram functionality:

class Histogram
{
    protected long sampleCount;
    protected ushort[] sampleValueCounts;

    public virtual long SampleCount { get { return sampleCount; } }
    public ushort[] SampleValueCounts { get { return sampleValueCounts; } }

    public Histogram(int range)
    {
        // + 1 to allow 0 - range, inclusive
        this.sampleValueCounts = new ushort[range + 1];
    }

    public virtual void AddSample(int value)
    {
        ++sampleValueCounts[value];
        ++sampleCount;
    }
}

Pretty simple, right?

Now let’s add the requirement that this needs to be access from multiple threads. How is that going to affect the implementation? How do we ensure that sampleCount and the sum of sampleValueCounts[] stay in sync?

The obvious solution is to add a lock in AddSample:

lock(sampleLock)
{
    ++sampleValueCounts[value];
    ++sampleCount;
}
    

Let’s add another requirement: performance. Does the locked version of AddSample perform well enough? Maybe—we should absolutely measure and find out. In my case, I knew some additional details in how the system worked and had past experience to think that I may want to reconsider the lock.

I knew that there would be many thousands of instances of the Histogram and that every second, hundreds to thousands of them would need to be updated. This could be very hot code.

AddSample is a very cheap function—two increments! By adding a lock, even one that most likely runs completely in user mode, I may have significantly increased the time it takes to execute this method. We’ll see an actual measurement below, and it’s not quite that bad in this case, but we can do better.

Lock first to Ensure Correctness

You should always try it first with a lock—make it correct first, then optimize. If your performance is fine with the lock, then you don’t need to go any further—just stop and work on something more important.

Before we add a lock, though, let’s see how the basic histogram performs with no locking and multiple threads:

Type: Histogram

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.0959830

SampleCount: 13,641,046 (Error: -86.359%)

Sum: 99,969,354 (Error: -0.031%)

On my 4 core machine, this took about 10 seconds. The sum of the histogram values was very close to the number of values we tried to add—definitely within our tolerance. Unfortunately, SampleCount is way off.

Intuitively, this makes sense. SampleCount is updated on every single sample, so between all the threads, there is ample opportunity to race and obliterate this value. But the array—not so much. Since I’m testing with a random distribution, there is very low likelihood that there will be a conflict in a particular slot.

Let’s see what happens when we add a lock:

class HistogramLocked : Histogram
{
    private object addLock = new object();

    public HistogramLocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        lock (addLock)
        {
            ++sampleValueCounts[value];
            ++sampleCount;
        }
    }
}

Type: HistogramLocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:08.0839173

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 100,000,000 (Error: 0.000%)

Woah, our time decreased! Actually, on my home machine it decreased. On my work machine (which has faster and more processors) the time did actually increase by about 10% (proving the point above: you know nothing until you measure).

In most cases, we should just stop here. We’ve ensured a correct system for very little cost. You would need to have a good reason to try to optimize this further.

Since we saw that there weren’t that many conflicts on the array, what if we just protect the sampleCount increment, and do that with an Interlocked.Increment() method?

class HistogramInterlocked : Histogram
{
    public HistogramInterlocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];
        Interlocked.Increment(ref this.sampleCount);
    }
}

Type: HistogramInterlocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.7659328

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,957,294 (Error: -0.043%)

We may as well just use a lock and get 100% accuracy in both counts.

Approximation can be Good Enough

Want to eke out even more performance? To do this without a lock, you need to give up one attribute: correctness. If there are enough samples, then missing a some is fine. You just need to do two things:

  1. Ensure your algorithm can work in the face of slightly incorrect data
  2. Minimize the error as cheaply as possible.

Once you develop an algorithm that approximates a result, you should be able to measure the error rate to validate that the performance gain is worth the loss of accuracy.

Now let’s see if we can get an approximate SampleCount that’s “good enough.”

Instead of having all threads increment a single value, they can all increment their own value. We “bucketize” the sampleCount variable. When we need the real total, we can just add them up, and, in effect, transfer some of the CPU cost to a more rare operation.

class HistogramThreadBuckets : Histogram
{
    private long[] valueBuckets;

    public long[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i];
            }
            return sum;
        }
    }

    public HistogramThreadBuckets(int range, int valueBuckets) : base(range) 
    {
        this.valueBuckets = new long[valueBuckets];
    }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];

        int bucket = Thread.CurrentThread.ManagedThreadId % this.valueBuckets.Length;
        ++valueBuckets[bucket];            
    }
}

I run this algorithm with multiple bucket counts to see the difference:

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:09.1901657

ValueCount Buckets: 4

SampleCount: 92,941,626 (Error: -7.058%)

Sum: 99,934,033 (Error: -0.066%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.9367826

ValueCount Buckets: 8

SampleCount: 96,263,770 (Error: -3.736%)

Sum: 99,938,926 (Error: -0.061%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.6882437

ValueCount Buckets: 16

SampleCount: 98,274,025 (Error: -1.726%)

Sum: 99,954,021 (Error: -0.046%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:04.6969067

ValueCount Buckets: 32

SampleCount: 99,934,543 (Error: -0.065%)

Sum: 99,936,526 (Error: -0.063%)

Cool, we just cut our time in half! Something about this should seem weird, though. How can this be significantly faster than the naïve algorithm when neither of them use locks? There’s no contention anyway!

Well, actually, there is. These buckets likely exist in the various caches for multiple processors, and when you modify one, a certain amount of communication goes on between CPUs to resolve conflicts and invalidate the caches. The more variables there are, the less likely this will happen.

Know Your Cache Line Size

Can we do even better?

If we know that all our CPUs are going to need to coordinate among cache lines, can we ensure we have NO conflicts between our threads? If we know the size of a cache line and absolutely guaranteed that each thread went to a different cache line, then we could take complete advantage of the CPU caches.

To get the size of a cache line, use one of many CPU-information tools available on the Internet. I used one freeware utility called CPU-Z. In my case, the cache line is 64-bytes, so I just need to ensure that the buckets holding the values are at least 64 bytes long. Since a long is 8 bytes, I can just pad out a struct to do this (or just multiply the size of the array and use only every nth entry).

The other thing we need to do is ensure there are no collisions between threads on the buckets. In the previous code, I just hashed the ManagedThreadId into the buckets, but this isn’t reliable since we don’t know the thread ids that we’ll get. The solution is just to send in our own “thread id” that we can reliably map to a unique bucket.

class HistogramThreadBucketsCacheLine : Histogram
{
    public struct ValueBucket
    {
        private long _junk0;
        private long _junk1;
        private long _junk2;
        
        public long value;

        private long _junk4;
        private long _junk5;
        private long _junk6;
        private long _junk7;            
    };

    private ValueBucket[] valueBuckets;

    public ValueBucket[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i].value;
            }
            return sum;
        }
    }

    public HistogramThreadBucketsCacheLine(int range, int valueBuckets)
        : base(range) 
    {
        this.valueBuckets = new ValueBucket[valueBuckets];
    }

    public override void AddSample(int value, int threadId)
    {
        ++sampleValueCounts[value];

        ++valueBuckets[threadId].value;
    }
}

And the output is:

Type: HistogramThreadBucketsCacheLine

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:02.0215371

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,761,090 (Error: -0.239%)

We more than halved it again! Not too shabby! Smile

Summary

By being willing to accept some loss of accuracy, and taking advantage of how processors work, we were able to reduce the running time of a fairly simple and superficially efficient algorithm by 80%. As with all performance investigations, your mileage may vary, and lots depends on the data your manipulating and the hardware it runs on. Optimizations like these should be quite rare, and only in places you absolutely need them.

Download the sample code project.

Like this tip? Check out my book, C # 4.0 How-To, where you’ll finds hundreds of great tips like this one.

Keep API Usage Simple

One lesson that I’m appreciating more every week is that old acronym, K.I.S.S. – Keep It Simple, Stupid. Simple code is best. The highest cost of most software is maintenance—and you don’t want maintenance to include a lot of “how the heck does this work?”

This simplicity does not apply to just your own code. Most API frameworks provide multiple ways of doing things, but if you don’t need it, don’t use it.

For example, I’ve been working a lot with the Task Parallel Library, and it’s great – it really is an awesome way of doing asynchronous programming. However, when you get to continuations, there are so many options, it can quickly make your code a bewildering mix of usage patterns, if you’re not careful.

You can have continuations called only when the task fails, or is canceled, or only when it succeeds, or only when it does not succeed. You can chain continuations, cancel them, and there are all sorts of scheduling options, and much more.

This API is powerful, and I’m grateful it’s all there, but when you need to have a large team work on a single code base, it’s helpful necessary to enforce some simplifying constraints. One thing we’ve done, specifically with the Task Parallel Library, is to require everyone to use just one of the many possible patterns. This way, no matter what part of the code we’re looking at, we can understand exactly how Tasks work. It takes the guesswork out, and reduces maintenance costs. This kind of thing is especially critical for complex areas of the code – i.e., asynchronous programming.

Takeaway: you don’t have to use every possible API feature or pattern—just pick one and enforce consistency and simplicity.

Activator.CreateInstance: Slow vs. Less-Slow

In the system I’m working on, there is a lot of runtime type resolution, e.g., lines like this:

var obj = Activator.CreateInstance(type, param1, param2, param3)

The params are arguments to the constructor of the type. The type itself is ultimately given through some configuration or otherwise-dynamic means.

Since we do a lot of this (many times per second), we’d like to ensure that instance creation is as fast as it can be. The method given above is not fast.

When this overload of CreateInstance is called, .Net has to do a lot of work to make sure it works: find the right constructor, verify that all the argument types match up, etc. This can be painfully slow, especially when, in a closed system, you can do a lot of offline validation to ensure that types are correctly specified.

Fortunately, Activator has another overload that is quite a bit faster, if you can use it.

Activator.CreateInstance(type)

This will call the default constructor of the type. However, we still have the problem of passing in those arguments. There are a few ways we could do this, but one of the simplest is to ensure that all the types you’re doing this for implement a common initialization interface, as in this sample program:

interface IPerson
{
    void Initialize(string name, DateTime dob, int weight);
}

class Employee : IPerson
{
    public Employee()
    {
    }

    public Employee(string name, DateTime dob, int weight)
    {
        Initialize(name, dob, weight);
    }

    public Employee(string name, DateTime dob)
        : this(name, dob, -1)
    {
    }

    public void Initialize(string name, DateTime dob, int weight)
    {
        this.name = name;
        this.dob = dob;
        this.weight = weight;
    }

    private string name;
    private DateTime dob;
    private int weight;
}

class Program
{
    static void Main(string[] args)
    {
        const int iterations = 1000000;

        Console.WriteLine("Timing Activator.CreateInstance(type, params)...");
        var employees = new IPerson[iterations];

        var stopwatch = Stopwatch.StartNew();

        for (int i = 0; i < iterations; ++i)
        {
            employees[i] = Activator.CreateInstance(typeof(Employee), 
                "Bob", DateTime.Now, 220) as IPerson;
        }

        stopwatch.Stop();

        Console.WriteLine("Total time: {0} for {1:N0} employees", 
            stopwatch.Elapsed.ToString(), iterations);
        Console.WriteLine();

        Console.WriteLine("Timing Activator.CreateInstance(type) + Initialization...");
        
        var employees2 = new IPerson[iterations];

        var stopwatch2 = Stopwatch.StartNew();            

        for (int i = 0; i < iterations; ++i)
        {
            employees2[i] = Activator.CreateInstance(typeof(Employee)) as IPerson;
            employees2[i].Initialize("Bob", DateTime.Now, 220);
        }

        stopwatch2.Stop();

        Console.WriteLine("Total time: {0} for {1:N0} employees", 
            stopwatch2.Elapsed.ToString(), iterations);
    }
}

What do you think the time difference in the two approaches is?

Timing Activator.CreateInstance(type, params)…

Total time: 00:00:05.4765664 for 1,000,000 employees

Timing Activator.CreateInstance(type) + Initialization…

Total time: 00:00:01.6007687 for 1,000,000 employees

About 3.5x faster. Pretty good. In practice, the improvement will vary. In some of our own timing experiments, we actually saw a much better speedup.

Like this tip? Check out my book, C # 4.0 How-To, where you’ll finds hundreds of great tips like this one.