Tag Archives: cache

Practical uses of WeakReference

In Part 1, I discussed the basics of WeakReference and WeakReference<T>. Part 2 introduced short and long weak references as well as the concept of resurrection. I also covered how to use the debugger to inspect your memory for the presence of weak references. This article will complete this miniseries with a discussion of when to use weak references at all and a small, practical example.

When to Use

Short answer: rarely. Most applications won’t require this.

Long answer: If all of the following criteria are met, then you may want to consider it:

  1. Memory use needs to be tightly restricted – this probably means mobile devices these days. If you’re running on Windows RT or Windows Phone, then your memory is restricted.
  2. Object lifetime is highly variable – if you can predict the lifetime of your objects well, then using WeakReference doesn’t really make sense. In that case, you should just control their lifetime directly.
  3. Objects are relatively large, but easy to create – WeakReference is really ideal for that large object that would be nice to have around, but if not, you could easily regenerate it as needed (or just do without).
  4. The object’s size is significantly more than the overhead of using WeakReference<T> – Using WeakReference<T> adds an additional object, which means more memory pressure, an extra dereference step. It would be a complete waste of time and memory to use WeakReference<T> to store an object that’s barely larger than WeakReference<T> itself. However, there are some caveats to this, below.

There is another scenario in which WeakReference may make sense. I call this the “secondary index” feature. Suppose you have an in-memory cache of objects, all indexed by some key. This could be as simple as Dictionary<string, Person>, for example. This is the primary index, and represents the most common lookup pattern, the master table, if you will.

However, you also want to look up these objects with another key, say a last name. Maybe you want a dozen other indexes. Using standard strong references, you could have additional indexes, such as Dictionary<DateTime, Person> for birthdays, etc. When it comes time to update the cache, you then have to modify all of these indexes to ensure that the Person object gets garbage collected when no longer needed.

This might be a pretty big performance hit to do this every time there is an update. Instead, you could spread that cost around by having all of the secondary indexes use WeakReference instead: Dictionary<DateTime, WeakReference<Person>>, or, if the index has non-unique keys (likely), Dictionary<DateTime, List<WeakReference<Person>>>.

By doing this, the cleanup process becomes much easier: you just update the master cache, which removes the only strong reference to the object. The next time a garbage collection runs (of the appropriate generation), the Person object will be cleaned up. If you ever access a secondary index looking for those objects, you’ll discover the object has been cleaned up, and you can clean up those indexes right then. This spreads out the cost of cleanup of the index overhead, while allowing the expensive cached objects to be cleaned up earlier.

Other Uses

This Stack Overflow thread has some additional thoughts, with some variations of the example below and other uses.

A rather famous and involved example is using WeakReferences to prevent the dangling event handler problem (where failure to unregister an event handler keeps objects in memory, despite them having no explicit references anywhere in your code).

Practical Example

I had mentioned in Chapter 2 (Garbage Collection) of Writing High-Performance .NET Code that WeakReference could be used in a multilevel caching system to allow objects to gracefully fall out of memory when pressure increases. You can start with strong references and then demote them to weak references according to some criteria you choose.

That is the example I’ll show here. Note that this not production-quality code. It’s only about 5% of the code you would actually need, even assuming this algorithm makes sense in your scenario. At a minimum, you probably want to implement IDictionary<TKey, TValue> on it, perhaps tighten up some of the temporary memory allocations, and more.

This is a very simple implementation. When you add items to the cache, it adds them as strong references (removing any existing weak references for that key). When you attempt to read a value from the cache, it tries the strong references first, before attempting the weak references.

Objects are demoted from strong to weak references based simply on a maximum age. This is admittedly rather simplistic, but it gets the point across.

using System; using System.Collections.Concurrent; using System.Collections.Generic; using System.Diagnostics; namespace WeakReferenceCache { sealed class HybridCache<TKey, TValue>

where TValue:class { class ValueContainer<T> { public T value; public long additionTime; public long demoteTime; } private readonly TimeSpan maxAgeBeforeDemotion; private readonly ConcurrentDictionary<TKey, ValueContainer<TValue>> strongReferences = new ConcurrentDictionary<TKey, ValueContainer<TValue>>(); private readonly ConcurrentDictionary<TKey, WeakReference<ValueContainer<TValue>>> weakReferences = new ConcurrentDictionary<TKey, WeakReference<ValueContainer<TValue>>>(); public int Count { get { return this.strongReferences.Count; } } public int WeakCount { get { return this.weakReferences.Count; } } public HybridCache(TimeSpan maxAgeBeforeDemotion) { this.maxAgeBeforeDemotion = maxAgeBeforeDemotion; } public void Add(TKey key, TValue value) { RemoveFromWeak(key); var container = new ValueContainer<TValue>(); container.value = value; container.additionTime = Stopwatch.GetTimestamp(); container.demoteTime = 0; this.strongReferences.AddOrUpdate(key, container, (k, existingValue) => container); } private void RemoveFromWeak(TKey key) { WeakReference<ValueContainer<TValue>> oldValue; weakReferences.TryRemove(key, out oldValue); } public bool TryGetValue(TKey key, out TValue value) { value = null; ValueContainer<TValue> container; if (this.strongReferences.TryGetValue(key, out container)) { AttemptDemotion(key, container); value = container.value; return true; } WeakReference<ValueContainer<TValue>> weakRef; if (this.weakReferences.TryGetValue(key, out weakRef)) { if (weakRef.TryGetTarget(out container)) { value = container.value; return true; } else { RemoveFromWeak(key); } } return false; } public void DemoteOldObjects() { var demotionList = new List<KeyValuePair<TKey, ValueContainer<TValue>>>(); long now = Stopwatch.GetTimestamp(); foreach(var kvp in this.strongReferences) { var age = CalculateTimeSpan(kvp.Value.additionTime, now); if (age > this.maxAgeBeforeDemotion) { demotionList.Add(kvp); } } foreach(var kvp in demotionList) { Demote(kvp.Key, kvp.Value); } } private void AttemptDemotion(TKey key, ValueContainer<TValue> container) { long now = Stopwatch.GetTimestamp(); var age = CalculateTimeSpan(container.additionTime, now); if (age > this.maxAgeBeforeDemotion) { Demote(key, container); } } private void Demote(TKey key, ValueContainer<TValue> container) { ValueContainer<TValue> oldContainer; this.strongReferences.TryRemove(key, out oldContainer); container.demoteTime = Stopwatch.GetTimestamp(); var weakRef = new WeakReference<ValueContainer<TValue>>(container); this.weakReferences.AddOrUpdate(key, weakRef, (k, oldRef) => weakRef); } private TimeSpan CalculateTimeSpan(long offsetA, long offsetB) { long diff = offsetB - offsetA; double seconds = (double)diff / Stopwatch.Frequency; return TimeSpan.FromSeconds(seconds); } } }

That’s it for the series on Weak References–I hope you enjoyed it! You may never need them, but when you do, you should understand how they work in detail to make the smartest decisions.

Concurrency, Performance, Arrays and when Dirty Writes are OK

This article will show how to increase the performance of a rather simple algorithm up to 80%, by showing how we can accept a loss in absolutely accuracy, and also taking advantage of how processors work.

Introduction

When we talk about multiple threads using a shared resource, often the first thing we think of is how to synchronize access to that resource so that it is always in a valid, deterministic state. There are, however, many times when the exact opposite approach is desired. Sometimes it’s appropriate to sacrifice accuracy in favor of performance.

To demonstrate the examples in this article, I developed a small test driver that will put our various test classes through their paces on multiple threads and measure a couple of key factors: elapsed time and error rate. To see the code for this driver, download the accompanying sample project.

Histogram

Recently, I needed to implement something that, reduced to essentials, is a histogram. It needed an array of counts and an integer tracking how many samples had been added. Here’s a simplified version of just the histogram functionality:

class Histogram
{
    protected long sampleCount;
    protected ushort[] sampleValueCounts;

    public virtual long SampleCount { get { return sampleCount; } }
    public ushort[] SampleValueCounts { get { return sampleValueCounts; } }

    public Histogram(int range)
    {
        // + 1 to allow 0 - range, inclusive
        this.sampleValueCounts = new ushort[range + 1];
    }

    public virtual void AddSample(int value)
    {
        ++sampleValueCounts[value];
        ++sampleCount;
    }
}

Pretty simple, right?

Now let’s add the requirement that this needs to be access from multiple threads. How is that going to affect the implementation? How do we ensure that sampleCount and the sum of sampleValueCounts[] stay in sync?

The obvious solution is to add a lock in AddSample:

lock(sampleLock)
{
    ++sampleValueCounts[value];
    ++sampleCount;
}
    

Let’s add another requirement: performance. Does the locked version of AddSample perform well enough? Maybe—we should absolutely measure and find out. In my case, I knew some additional details in how the system worked and had past experience to think that I may want to reconsider the lock.

I knew that there would be many thousands of instances of the Histogram and that every second, hundreds to thousands of them would need to be updated. This could be very hot code.

AddSample is a very cheap function—two increments! By adding a lock, even one that most likely runs completely in user mode, I may have significantly increased the time it takes to execute this method. We’ll see an actual measurement below, and it’s not quite that bad in this case, but we can do better.

Lock first to Ensure Correctness

You should always try it first with a lock—make it correct first, then optimize. If your performance is fine with the lock, then you don’t need to go any further—just stop and work on something more important.

Before we add a lock, though, let’s see how the basic histogram performs with no locking and multiple threads:

Type: Histogram

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.0959830

SampleCount: 13,641,046 (Error: -86.359%)

Sum: 99,969,354 (Error: -0.031%)

On my 4 core machine, this took about 10 seconds. The sum of the histogram values was very close to the number of values we tried to add—definitely within our tolerance. Unfortunately, SampleCount is way off.

Intuitively, this makes sense. SampleCount is updated on every single sample, so between all the threads, there is ample opportunity to race and obliterate this value. But the array—not so much. Since I’m testing with a random distribution, there is very low likelihood that there will be a conflict in a particular slot.

Let’s see what happens when we add a lock:

class HistogramLocked : Histogram
{
    private object addLock = new object();

    public HistogramLocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        lock (addLock)
        {
            ++sampleValueCounts[value];
            ++sampleCount;
        }
    }
}

Type: HistogramLocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:08.0839173

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 100,000,000 (Error: 0.000%)

Woah, our time decreased! Actually, on my home machine it decreased. On my work machine (which has faster and more processors) the time did actually increase by about 10% (proving the point above: you know nothing until you measure).

In most cases, we should just stop here. We’ve ensured a correct system for very little cost. You would need to have a good reason to try to optimize this further.

Since we saw that there weren’t that many conflicts on the array, what if we just protect the sampleCount increment, and do that with an Interlocked.Increment() method?

class HistogramInterlocked : Histogram
{
    public HistogramInterlocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];
        Interlocked.Increment(ref this.sampleCount);
    }
}

Type: HistogramInterlocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.7659328

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,957,294 (Error: -0.043%)

We may as well just use a lock and get 100% accuracy in both counts.

Approximation can be Good Enough

Want to eke out even more performance? To do this without a lock, you need to give up one attribute: correctness. If there are enough samples, then missing a some is fine. You just need to do two things:

  1. Ensure your algorithm can work in the face of slightly incorrect data
  2. Minimize the error as cheaply as possible.

Once you develop an algorithm that approximates a result, you should be able to measure the error rate to validate that the performance gain is worth the loss of accuracy.

Now let’s see if we can get an approximate SampleCount that’s “good enough.”

Instead of having all threads increment a single value, they can all increment their own value. We “bucketize” the sampleCount variable. When we need the real total, we can just add them up, and, in effect, transfer some of the CPU cost to a more rare operation.

class HistogramThreadBuckets : Histogram
{
    private long[] valueBuckets;

    public long[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i];
            }
            return sum;
        }
    }

    public HistogramThreadBuckets(int range, int valueBuckets) : base(range) 
    {
        this.valueBuckets = new long[valueBuckets];
    }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];

        int bucket = Thread.CurrentThread.ManagedThreadId % this.valueBuckets.Length;
        ++valueBuckets[bucket];            
    }
}

I run this algorithm with multiple bucket counts to see the difference:

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:09.1901657

ValueCount Buckets: 4

SampleCount: 92,941,626 (Error: -7.058%)

Sum: 99,934,033 (Error: -0.066%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.9367826

ValueCount Buckets: 8

SampleCount: 96,263,770 (Error: -3.736%)

Sum: 99,938,926 (Error: -0.061%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.6882437

ValueCount Buckets: 16

SampleCount: 98,274,025 (Error: -1.726%)

Sum: 99,954,021 (Error: -0.046%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:04.6969067

ValueCount Buckets: 32

SampleCount: 99,934,543 (Error: -0.065%)

Sum: 99,936,526 (Error: -0.063%)

Cool, we just cut our time in half! Something about this should seem weird, though. How can this be significantly faster than the naïve algorithm when neither of them use locks? There’s no contention anyway!

Well, actually, there is. These buckets likely exist in the various caches for multiple processors, and when you modify one, a certain amount of communication goes on between CPUs to resolve conflicts and invalidate the caches. The more variables there are, the less likely this will happen.

Know Your Cache Line Size

Can we do even better?

If we know that all our CPUs are going to need to coordinate among cache lines, can we ensure we have NO conflicts between our threads? If we know the size of a cache line and absolutely guaranteed that each thread went to a different cache line, then we could take complete advantage of the CPU caches.

To get the size of a cache line, use one of many CPU-information tools available on the Internet. I used one freeware utility called CPU-Z. In my case, the cache line is 64-bytes, so I just need to ensure that the buckets holding the values are at least 64 bytes long. Since a long is 8 bytes, I can just pad out a struct to do this (or just multiply the size of the array and use only every nth entry).

The other thing we need to do is ensure there are no collisions between threads on the buckets. In the previous code, I just hashed the ManagedThreadId into the buckets, but this isn’t reliable since we don’t know the thread ids that we’ll get. The solution is just to send in our own “thread id” that we can reliably map to a unique bucket.

class HistogramThreadBucketsCacheLine : Histogram
{
    public struct ValueBucket
    {
        private long _junk0;
        private long _junk1;
        private long _junk2;
        
        public long value;

        private long _junk4;
        private long _junk5;
        private long _junk6;
        private long _junk7;            
    };

    private ValueBucket[] valueBuckets;

    public ValueBucket[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i].value;
            }
            return sum;
        }
    }

    public HistogramThreadBucketsCacheLine(int range, int valueBuckets)
        : base(range) 
    {
        this.valueBuckets = new ValueBucket[valueBuckets];
    }

    public override void AddSample(int value, int threadId)
    {
        ++sampleValueCounts[value];

        ++valueBuckets[threadId].value;
    }
}

And the output is:

Type: HistogramThreadBucketsCacheLine

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:02.0215371

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,761,090 (Error: -0.239%)

We more than halved it again! Not too shabby! Smile

Summary

By being willing to accept some loss of accuracy, and taking advantage of how processors work, we were able to reduce the running time of a fairly simple and superficially efficient algorithm by 80%. As with all performance investigations, your mileage may vary, and lots depends on the data your manipulating and the hardware it runs on. Optimizations like these should be quite rare, and only in places you absolutely need them.

Download the sample code project.

Like this tip? Check out my book, C # 4.0 How-To, where you’ll finds hundreds of great tips like this one.