Tag Archives: threads

4 Essential Tips for High-Performance Garbage Collection on Servers

Garbage CollectionUpdate: If you find this useful, you can read a much more complete treatment of garbage collection and performance in my book Writing High-Performance .NET Code.

Update: Part 2 – How to Debug GC Issues with PerfView is now available.

On this blog, I’ve alluded to the fact that I work on high-performance server applications, most recently in .Net. Writing these in .Net is just as possible as it is in native code, but it does come with its own set of challenges. In particular, one of the biggest things you need to learn how to deal with is garbage collection.

There is a lot out there already written about the CLR’s garbage collector, so I’m not going to go over many of the details. If you need a primer on it, MSDN has some documentation:

Read that first. For the rest of this article series, I will assume that you understand how the GC basically works.

In this and future articles, I’ll cover a lot of the stuff I’ve learned to improve application performance in the face of garbage collection.

Tip 1: Use Server GC

There are two modes of garbage collection (GC): workstation and server. As long as you’re running multiple processors, you almost certainly want server mode collection. With workstation mode, a GC happens on the thread that makes the allocation that causes the GC. The collection happens at normal priority.

With server GC, a thread for every core is created just for doing GC. There is also a small object heap and a large object heap created for each GC thread. All of the program’s allocations are spread among these heaps (more on large object heaps later). When no GC is happening, these threads are blocked and do nothing. When a GC is triggered, all of the user threads get paused, and all the GC threads wake up at highest priority and do collection in parallel. All of these optimizations lead to server GC usually being much faster than workstation GC.

A word about concurrent collections: In workstation GC, concurrent collections are enabled by default. However, this only applies to generation 2 collections. Generations 0 and 1 are always blocking. However, given that it’s concurrent, that means that it will compete with your own threads that are trying to get actual work done. In a high-performance server scenario, that may not be acceptable. A better strategy is to ensure that generation 2 collections never (or extremely rarely) happen.

You enable server GC by putting this in app.config:

<configuration>
   <runtime>
      <gcServer enabled="true"/>
   </runtime>
</configuration>

 

Tip 2: Objects Live Briefly or Forever

A histogram of object lifetimes in your app should look essentially like this:

image

Object last either a vanishingly brief amount of time, or they last forever – it’s the stuff in the middle that will kill your performance.

This has everything to do with the generations of garbage collection and object survivorship. There are three generations: 0, 1, and 2. Generation 0 happens most often and is the fastest—ideally lasting only a couple of milliseconds, if that. Objects that didn’t get cleaned up in generation 0 are put into generation 1. Generation 1 collections are also very fast, usually as fast as generation 0. The problem, though, is that objects that make it to generation 1 have a fair chance of surviving this generation, and being put into generation 2.

Generation 2 is the problem. A generation 2 collection is much slower than 0 or 1—often on the order of hundreds of milliseconds or even seconds—that means your process is paused completely for that time. You do not want objects to survive to generation 2.

So how often do collections happen? There is no hard-and-fast rule: it all depends on your allocation rate, memory pressure, and patterns that you’ve trained into the GC. The GC will adapt over time, training itself on your memory usage patterns. All of this completely depends on your application and I’ll look at ways to measure all of this in a future article.

Tip 3: All Long-Lived Objects Must Be Pooled

It may be that you can’t ensure all objects for a given request are cleaned up in the first generation 0 collection that occurs. If requests are in memory longer than the time between collections, then you’re guaranteed to have survivorship.

For these types of objects, first see if you can factor them so that not all parts them have to live that long. Control object lifetime very closely and null out references once you’re done.

Once you’ve done that, hopefully there are only a handful of objects that really must last the entire length of a request. For those, create a pool of them with reinitialization semantics—effectively move them to the far end of that histogram above, where they live forever.

This works because of the adaptive nature of the garbage collector – it learns over time that if it does a collection and doesn’t free up much memory, it will schedule that generation of collection to happen less frequently. In my own case, at one point, our server had trained the GC to do a generation 2 collection less often than once per day, under a constant load. With enough work, we could probably get that to essentially never.

You may be able to get quite far without the need to implement object pooling. Or you may need to pool only a small number of objects, and the survivorship of the remaining objects is not enough to cause problematic garbage collections—only measurement and observation will tell you for sure.

Tip 4: All Large Objects Must Be Pooled

There is a way to cause an object to automatically be in generation 2: make it at least 85000 bytes in size. Anything at least that size gets put into a Large Object Heap. Only generation 2 collections service that type of heap.

Want to cause a generation 2 collection? Do this:

byte[] buffer = new byte[85000];

If you want high-performance, you absolutely cannot do this per request on a server. These types of buffers, or other large objects, must be pooled. There is no built-in pooling mechanism in .Net—you must write your own. There are usually not too many large objects you’ll need to pool: strings and byte buffers are the usual suspects, if you need to do much serialization/deserialization, but also look out for collections of any type.

If you want to know more about the Large Object Heap and why 85000 bytes is the threshold, read this great article: Large Object Heap Uncovered.

Pooling collection objects comes with its own set of challenges:

  • You can’t assume the full collection is valid (the difference between length and capacity). If you use pooled arrays, for example, you have to track the length separately, since only a small portion of the array may be valid. This can drastically affect the interfaces between components.
  • Pooled collections that can grow over time will cause your memory to rise indefinitely unless you put limits on the size of the pool and/or the size of collections within the pool.
  • Large Object Heaps are not compacted during collection, which means that you can fragment the heap such that it’s wasting a lot of memory. It all depends on your allocation and collection pattern. I may talk about heap fragmentation in another article.

Once you solve those, you’re good to go… no more generation 2 collections!

Next Time…

In my next article, I’ll cover tools you can use to measure garbage collection statistics, and how you can use that knowledge to improve your performance.


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Concurrency, Performance, Arrays and when Dirty Writes are OK

This article will show how to increase the performance of a rather simple algorithm up to 80%, by showing how we can accept a loss in absolutely accuracy, and also taking advantage of how processors work.

Introduction

When we talk about multiple threads using a shared resource, often the first thing we think of is how to synchronize access to that resource so that it is always in a valid, deterministic state. There are, however, many times when the exact opposite approach is desired. Sometimes it’s appropriate to sacrifice accuracy in favor of performance.

To demonstrate the examples in this article, I developed a small test driver that will put our various test classes through their paces on multiple threads and measure a couple of key factors: elapsed time and error rate. To see the code for this driver, download the accompanying sample project.

Histogram

Recently, I needed to implement something that, reduced to essentials, is a histogram. It needed an array of counts and an integer tracking how many samples had been added. Here’s a simplified version of just the histogram functionality:

class Histogram
{
    protected long sampleCount;
    protected ushort[] sampleValueCounts;

    public virtual long SampleCount { get { return sampleCount; } }
    public ushort[] SampleValueCounts { get { return sampleValueCounts; } }

    public Histogram(int range)
    {
        // + 1 to allow 0 - range, inclusive
        this.sampleValueCounts = new ushort[range + 1];
    }

    public virtual void AddSample(int value)
    {
        ++sampleValueCounts[value];
        ++sampleCount;
    }
}

Pretty simple, right?

Now let’s add the requirement that this needs to be access from multiple threads. How is that going to affect the implementation? How do we ensure that sampleCount and the sum of sampleValueCounts[] stay in sync?

The obvious solution is to add a lock in AddSample:

lock(sampleLock)
{
    ++sampleValueCounts[value];
    ++sampleCount;
}
    

Let’s add another requirement: performance. Does the locked version of AddSample perform well enough? Maybe—we should absolutely measure and find out. In my case, I knew some additional details in how the system worked and had past experience to think that I may want to reconsider the lock.

I knew that there would be many thousands of instances of the Histogram and that every second, hundreds to thousands of them would need to be updated. This could be very hot code.

AddSample is a very cheap function—two increments! By adding a lock, even one that most likely runs completely in user mode, I may have significantly increased the time it takes to execute this method. We’ll see an actual measurement below, and it’s not quite that bad in this case, but we can do better.

Lock first to Ensure Correctness

You should always try it first with a lock—make it correct first, then optimize. If your performance is fine with the lock, then you don’t need to go any further—just stop and work on something more important.

Before we add a lock, though, let’s see how the basic histogram performs with no locking and multiple threads:

Type: Histogram

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.0959830

SampleCount: 13,641,046 (Error: -86.359%)

Sum: 99,969,354 (Error: -0.031%)

On my 4 core machine, this took about 10 seconds. The sum of the histogram values was very close to the number of values we tried to add—definitely within our tolerance. Unfortunately, SampleCount is way off.

Intuitively, this makes sense. SampleCount is updated on every single sample, so between all the threads, there is ample opportunity to race and obliterate this value. But the array—not so much. Since I’m testing with a random distribution, there is very low likelihood that there will be a conflict in a particular slot.

Let’s see what happens when we add a lock:

class HistogramLocked : Histogram
{
    private object addLock = new object();

    public HistogramLocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        lock (addLock)
        {
            ++sampleValueCounts[value];
            ++sampleCount;
        }
    }
}

Type: HistogramLocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:08.0839173

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 100,000,000 (Error: 0.000%)

Woah, our time decreased! Actually, on my home machine it decreased. On my work machine (which has faster and more processors) the time did actually increase by about 10% (proving the point above: you know nothing until you measure).

In most cases, we should just stop here. We’ve ensured a correct system for very little cost. You would need to have a good reason to try to optimize this further.

Since we saw that there weren’t that many conflicts on the array, what if we just protect the sampleCount increment, and do that with an Interlocked.Increment() method?

class HistogramInterlocked : Histogram
{
    public HistogramInterlocked(int range) : base(range) { }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];
        Interlocked.Increment(ref this.sampleCount);
    }
}

Type: HistogramInterlocked

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:10.7659328

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,957,294 (Error: -0.043%)

We may as well just use a lock and get 100% accuracy in both counts.

Approximation can be Good Enough

Want to eke out even more performance? To do this without a lock, you need to give up one attribute: correctness. If there are enough samples, then missing a some is fine. You just need to do two things:

  1. Ensure your algorithm can work in the face of slightly incorrect data
  2. Minimize the error as cheaply as possible.

Once you develop an algorithm that approximates a result, you should be able to measure the error rate to validate that the performance gain is worth the loss of accuracy.

Now let’s see if we can get an approximate SampleCount that’s “good enough.”

Instead of having all threads increment a single value, they can all increment their own value. We “bucketize” the sampleCount variable. When we need the real total, we can just add them up, and, in effect, transfer some of the CPU cost to a more rare operation.

class HistogramThreadBuckets : Histogram
{
    private long[] valueBuckets;

    public long[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i];
            }
            return sum;
        }
    }

    public HistogramThreadBuckets(int range, int valueBuckets) : base(range) 
    {
        this.valueBuckets = new long[valueBuckets];
    }

    public override void AddSample(int value)
    {
        ++sampleValueCounts[value];

        int bucket = Thread.CurrentThread.ManagedThreadId % this.valueBuckets.Length;
        ++valueBuckets[bucket];            
    }
}

I run this algorithm with multiple bucket counts to see the difference:

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:09.1901657

ValueCount Buckets: 4

SampleCount: 92,941,626 (Error: -7.058%)

Sum: 99,934,033 (Error: -0.066%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.9367826

ValueCount Buckets: 8

SampleCount: 96,263,770 (Error: -3.736%)

Sum: 99,938,926 (Error: -0.061%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:06.6882437

ValueCount Buckets: 16

SampleCount: 98,274,025 (Error: -1.726%)

Sum: 99,954,021 (Error: -0.046%)

Type: HistogramThreadBuckets

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:04.6969067

ValueCount Buckets: 32

SampleCount: 99,934,543 (Error: -0.065%)

Sum: 99,936,526 (Error: -0.063%)

Cool, we just cut our time in half! Something about this should seem weird, though. How can this be significantly faster than the naïve algorithm when neither of them use locks? There’s no contention anyway!

Well, actually, there is. These buckets likely exist in the various caches for multiple processors, and when you modify one, a certain amount of communication goes on between CPUs to resolve conflicts and invalidate the caches. The more variables there are, the less likely this will happen.

Know Your Cache Line Size

Can we do even better?

If we know that all our CPUs are going to need to coordinate among cache lines, can we ensure we have NO conflicts between our threads? If we know the size of a cache line and absolutely guaranteed that each thread went to a different cache line, then we could take complete advantage of the CPU caches.

To get the size of a cache line, use one of many CPU-information tools available on the Internet. I used one freeware utility called CPU-Z. In my case, the cache line is 64-bytes, so I just need to ensure that the buckets holding the values are at least 64 bytes long. Since a long is 8 bytes, I can just pad out a struct to do this (or just multiply the size of the array and use only every nth entry).

The other thing we need to do is ensure there are no collisions between threads on the buckets. In the previous code, I just hashed the ManagedThreadId into the buckets, but this isn’t reliable since we don’t know the thread ids that we’ll get. The solution is just to send in our own “thread id” that we can reliably map to a unique bucket.

class HistogramThreadBucketsCacheLine : Histogram
{
    public struct ValueBucket
    {
        private long _junk0;
        private long _junk1;
        private long _junk2;
        
        public long value;

        private long _junk4;
        private long _junk5;
        private long _junk6;
        private long _junk7;            
    };

    private ValueBucket[] valueBuckets;

    public ValueBucket[] ValueBuckets { get { return this.valueBuckets; } }

    public override long SampleCount
    {
        get
        {
            long sum = 0;
            for (int i = 0; i < this.valueBuckets.Length; ++i)
            {
                sum += this.valueBuckets[i].value;
            }
            return sum;
        }
    }

    public HistogramThreadBucketsCacheLine(int range, int valueBuckets)
        : base(range) 
    {
        this.valueBuckets = new ValueBucket[valueBuckets];
    }

    public override void AddSample(int value, int threadId)
    {
        ++sampleValueCounts[value];

        ++valueBuckets[threadId].value;
    }
}

And the output is:

Type: HistogramThreadBucketsCacheLine

Adding 100,000,000 samples on 32 threads

Elapsed: 00:00:02.0215371

SampleCount: 100,000,000 (Error: 0.000%)

Sum: 99,761,090 (Error: -0.239%)

We more than halved it again! Not too shabby! Smile

Summary

By being willing to accept some loss of accuracy, and taking advantage of how processors work, we were able to reduce the running time of a fairly simple and superficially efficient algorithm by 80%. As with all performance investigations, your mileage may vary, and lots depends on the data your manipulating and the hardware it runs on. Optimizations like these should be quite rare, and only in places you absolutely need them.

Download the sample code project.

Like this tip? Check out my book, C # 4.0 How-To, where you’ll finds hundreds of great tips like this one.


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Pausing a Thread Safely

In .Net you have the option of Thread.Suspend() and Thread.Abort(), but under normal circumstances these are horrible options to use because they’re likely to create locking problems (i.e., the thread has a locks on resource that other threads need).

A far better way is to signal to the thread that it should pause and wait for a signal to continue.  Sometimes it can be as simple as setting a boolean value where both threads can see it. Or you could wrap the boolean value in a property and protect it with a lock, but this is probably overkill for most applications.

To pause a thread safely requires you to use a blocking signal mechanism and .Net has ManualResetEvent for just this purpose.

You could then have some C# code like this:

ManualResetEvent pauseEvent = new ManualResetEvent();

//Thread A (Controlling thread)
void OnPause()
{
 pauseEvent.Reset();
}

void OnStart()
{
pauseEvent.Set();
}

//Thread B (the one to pause)
void ThreadFunction()
{
while(doWork == true)
 {
 //do work
 …
 …

 //wait until event is set
 pauseEvent.WaitOne();
 }
}

This way, you can safely pause work and continue it at a later time, safely avoiding the possibility of deadlocks*.

* Actually, it’s still possible to create a resource locking problem–make sure you release any shared resource locks before pausing.


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Threads in MFC Part III: Exceptions, Suspense, Murder, and Safety

Exceptions

In the previous tutorial, I described the various synchronization objects you can use to control access to shared objects. In most cases, these will work fine, but consider the following situation:

[code lang=”cpp”]
UINT ThreadFunc(LPVOID lParam)
{
::criticalSection.Lock();
::globalData.DoSomething();
SomeFunctionThatThrowsException();
::criticalSection.Unlock();
return 0;
}[/code]

What’s going to happen when that exception gets thrown? The critical section will never be unlocked. If you start the thread again, it will again try to lock it, and finding it already locked, it will sit there forever waiting. Of course, a mutex will unlock when the thread exits, but a critical section won’t. So MFC has a couple of wrapper classes that can incorporate any of the other basic synchronization classes. These are called CSingleLock and CMultiLock.

Here is how they are used:
[code lang=”cpp”]
UINT ThreadFunc(LPVOID lParam)
{
CSingleLock lock(&(::criticalSection));
lock.Lock();
::globalData.DoSomething();
FunctionThatThrowsException();
lock.Unlock();
return 0;
}
[/code]

You merely pass the address of the “real” synchronization object. CSingleLock lock is created on ThreadFunc’s stack, so when an exception is thrown, and that function exits prematurely without a chance to nicely clean up, CSingleLock’s destructor is called, which unlocks the data. This would not happen to criticalSection because, being a global variable, it will not go out of scope and be destroyed when ThreadFunc exits.

CMultiLock

This class allows you to block, or wait, on up to 64 synchornization objects at once. You create an array of references to the objects, and pass this to the constructor. In addition, you can specify whether you want it to unblock when one object unlocks, or when all of them do.

[code lang=”cpp”]
//let’s pretend these are all global objects, or defined other than in the local function
\tCCriticalSection cs;
CMutex mu;
CEvent ev;
CSemaphore sem[3];

CSynObject* objects[6]={&cs, &mu, &ev,
&sem[0], &sem[1], &sem[2]};
CMultiLock mlock(objects,6);
int result=mlock.Lock(INFINITE, FALSE);

[/code]

Notice you can mix synchronization object types. The two parameters I specified (both optional) specify the time-out period and whether to wait for all the objects to unlock before continuing. I saved the return value of Lock() because that is the index into the array of objects of the one that unblocked, in case I want to do special processing.

Killing a Thread

Generally, murder is very messy. You have blood and guts everywhere that certainly don’t clean up after themselves. But sometimes, sadly, it is necessary (no one call the cops–my metaphor is about to end).

If you start a child thread, and for some reason it is just not exiting when you need it to, and you’ve fixed your code, double-checked all your signaling mechanisms, and then and only then you want to kill it, here’s how. When you create the thread, you need to get its handle and save it for later use in your
class.

[code lang=”cpp”]
HANDLE hThread;//handle to thread
[/code]

A handle is only valid while the thread is running. What if we create a thread, start it off running, and it exits immediately for some reason? Back in our main thread, even if the very next statement after creating the thread is to grab its handle, it could very possibly be too late.

So we create a thread suspended! We just don’t even let it get to first base before we allow ourselves to get to the handle. This is a piece of cake, simply change the last parameter we’ve been giving fxBeginThread() from 0 toCREATE_SUSPENDED:

[code lang=”cpp”]
CWinThread* pThread=AfxBeginThread(ThreadFunc,NULL, THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED);
::DuplicateHandle(GetCurrentProcess(), pThread->m_hThread, GetCurrentProcess(), &hThread, 0, FALSE, DUPLICATE_SAME_ACCESS);
pThread->;
ResumeThread();
[/code]

We start the thread suspended, use an API call to duplicate the thread’s handle, saving it to our class variable, and then resuming the thread.

Then, if we want to commit this heinous crime:

[code lang=”cpp”]::TerminateThread(hThread,0);
[/code]

Don’t say I didn’t warn you.

Thread-Safe Classes

Thread-safety refers to the possibility of calling member functions across thread-boundaries. Their are two types of safety: Class-level and Object-level. Class level means that I can create two CStringT objects called a and b, and access each of them in separate threads, but I cannot safely access just a in
two threads. Object safety means that it’s perfectly ok to access a in two or more threads simultaneously. Thread-safety at the object level generally means using synchronization objects to control access to all internal datamembers. So why not make all classes thread-safe at the object level? Because that would just about kill your performance. You can lock objects yourself outside of the actual object (as shown in Part II) to make it safe.

This is not to say that your program will always crash if you try to access a single object from two threads, but it most likely will. Also, you should not generally lock access to MFC member functions or public variables–you don’t know when the MFC framework is going to need access to them. There really isn’t need to lock on a CWnd* object anyway.

Etc

There are many, many details I have neglected to cover in these three tutorials. You can look in the SDK or .NET documentation for more information on such things as pausing/resuming, scheduling, masks in CMultiLock(), or any of the other member functions of the thread classes. If you want to learn about the internal details of Windows, threads and fibers, (plus a lot of other important subjects) check out Programming Applications for Microsoft Windows by Jeffrey Richter.

I have yet to cover so-called user-interface threads (internally, there is no difference–all threads are created equal). Perhaps in a future tutorial…

Threads are a very powerful tool, but they can quickly increase the complexity of your application by an order of magnitude. Use wisely. As always, it takes some experimentation to get the hang of how to go about it. So have fun!

©2004 Ben Watson


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Threads in MFC II: Synchronization Objects

Introduction

In part I, I looked at getting threads communicating with each other. Now let’s look at how we can manage how multiple threads operate on single objects.

Let’s take an example. Suppose we have a global variable (or any variable that is accessible to two or more threads via scope, pointers, references, whatever). Let’s say this variable object is a CStringArray called stringArray
.

Now, let’s suppose our main thread wants to add something to the array. Fine enough. We can do that. Then, let’s throw in a second thread which can somehow access this object. It, too, wants to access stringArray . What would happen if both threads tried to simultaneously write to the first position in the array for example? Or even if one were just reading and the other writing? Well, if there is no synchronization between the two threads, you don’t know what would happen. The result is completely unpredicatable. One thread would write some bytes to memory, while another reads it, and you could have the correct answer or the wrong answer or a mix. Or it could crash. Who knows…

You can’t even assume safety when merely reading an object from two threads. Even if it seems like no bytes are changing, and both threads should get valid results, you have to think about a lower level: A single C++ statement compiles to many assembly or machine language instructions. These instructions directly access the processor, including the registers that keep track of where we are, what data we’re looking at. It’s possible to have one of those registers hold a pointer to the current character in the string, so if you have two threads that rely on that pointer in that register–they are obviously not both going to be correct except in a very rare circumstance.

OK, I think I’ve made the case. How do we control access to objects then?

Windows has a number of synchronization objects that you can use to effectively prevent accidents. MFC encapsulates these into CEvent , CCriticalSection, CMutex , and CSemaphore . To use these, include afxmt.hin your project.

CEvent

Let’s start with these so-called triggers. An event in this context is nothing more than a flag, a trigger. Imagine it as cocking a gun (Reset) and then firing it (Set). You can use events for setting of threads. Here’s how.
Remember how we created a structure that contained all the data we wanted to send the thread? Let’s add a new one. First create a CEvent object in the dialog (or any window or non-window) class called m_event . Now, in our [code lang=”cpp”]THREADINFOSTRUCT [/code], let’s add a pointer to an event:

[code lang=”cpp”]
typedef struct THREADINFOSTRUCT {

CEvent* pEvent;

} THREADINFOSTRUCT;
[/code]

When we initialize the structure, we must do the assignment:

[code lang=”cpp”]tis->pEvent=&m_event; [/code]

In our thread function, we call:

[code lang=”cpp”]tis->pEvent->Lock(); [/code]

This will “lock” on the event (the same event that is in our dialog class in the main thread). The thread will effectively stop. It will loop inside of CEvent::Lock() until that event is “Set.” Where do you set it? In the main thread. An event is initially reset–cocked. Create the thread. When you want the thread to unblock itself and continue, you call m_event.Set()–fire the gun.

So what are some practical examples? You could lock a thread before you access a global object. In your main thread, when you’re done using that object, you call Set(). You can also use an event to signal a thread to exit (such as if
you hit an abort button in the main thread). To see an example of this usage, look at the demo project I’ve uploaded to the code tool section.

There are two types of events: ones that automatically reset when you set them, and ones that don’t.

You can use a single event to trigger multiple threads, but the event had better be a manual-reset event or only one thread will be triggered at a time.

CCriticalSection

These are pretty simple to use. You simply surround every usage of the shared object by a lock and an unlock command:

[code lang=”cpp”]CCriticalSection cs;

cs.Lock();
stringArray.DoSomething();
cs.Unlock();
[/code]
Do that in every thread that uses that object. You must use the same critical section variable to lock the same object. If a thread tries to lock an object that’s already locked, it will just sit there waiting for it to unlock so it can safely access the object.
CMutex

A mutex works just like a critical section, but it can also work across different processes. But you don’t want to always use mutexes, because they are slower than critical sections.
You can declare a mutex like this:

[code lang=”cpp”]CMutex m_mutex(FALSE, “MyMutex”); [/code]
The first parameter specifies whether or not the mutex is initially locked or not. The second parameter is the identifier of the mutex so it can be accessed from two different processes.

If you lock a critical section in a thread and then the thread exits without unlocking it, then any other thread waiting on it will be forever blocked. Mutexes, however, will unlock automatically if the thread exits. Mutexes can
also have a time-out value (critical sections can too, but there are some doubts as to whether or not they work–perhaps the bugs are fixed in MFC 7.0).

Otherwise, it works the same:

[code lang=”cpp”]
m_muytex.Lock(60000);//time out in milliseconds
stringArray.DoSomething();
m_mutex.Unlock();
[/code]

CSemaphore

A semaphore is used to limit simultaneous access of a resource to a certain number of threads. Most commonly, this resource is a pool of a certain number of limited resources. If we had ten string arrays, we could set up a semaphore to guard them and let only ten threads at a time access them. Or COM ports, internet connections, or anything else.

It’s declared like this:

[code lang=”cpp”]CSemaphore m_semaphore(10,10); [/code]

The first argument is the initial reference count, while the second is the maximum reference count. Each time we lock the semaphore, it will decrement the reference count by 1, until it reaches zero. If another thread tries to lock the semaphore, then it will just go into a holding pattern until a thread unlocks it.

As with a mutex, you can pass it a time-out value.

It’s used with the same syntax:

[code lang=”cpp”]
m_semaphore.Lock(60000);
stringArray.DoSomething();
m_semaphore.Unlock();
[/code]

Conclusion

These two tutorials, along with the sample projects, should be enough to get you started using threads. There are a couple of other MFC objects and issues that I have yet to cover, so I’ll group all of these into Part III of this
tutorial. These topics include exception-handling and thread-safe classes. Make sure that you examine the documentation of all of these classes: there is more functionality than I could cover in this short tutorial. And if you really want to learn threads, get a good book that covers the Windows kernel (one called Programming Applications for MS Windows comes to mind, published by Microsoft).

The sample project for this tutorial has a time object that it shares between two threads. It’s protected by a critical section. There are also two events: for starting the thread and aborting it. The main thread uses a timer to add the current time to a list box every second, while the thread traces the current time to the debug window and sends a message to the main thread to remove the first time from the list. The thread is only started after the time
hits an even ten-second boundary.

©2004 Ben Watson


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Threads in MFC I: Worker Threads

There are two types of threads in MFC. Worker and User Interface. Here, I will discuss how to use a worker thread.First, let’s discuss some multi-threading basics. Each application has what we call a process. Usually, an application has only one process. This process defines all the code and memory space for the application. You can use the Window Task Mananger to view running processes.

You could possibly view a thread as a process within a process. It is an independently (mostly) running sub-process, that the CPU can task and switch to like any other process on the machine.

Under 16-bit Windows, you could have mulitple processes (i.e., many programs running: multi-tasking). However, each application was limited to its one main process. It was multi-tasked, not multi-threaded. With 32-bit Windows, applications could spawn their own threads or sub-processes.
 

Threads have priority levels. The explanation of exactly how Windows manages these in determining how much processor time each receives is a topic you can find in the MSDN literature. Basically, higher priorities receive more time.

When your Win32 program creates a thread, you specify its priority level. By default, it has the same priority as the calling thread.

There are two main issues you must deal with when using threads: 1) Inter-thread communication, and 2) inter-thread object access.

I’ll leave object access for part II of this tutorial.

Communication

The easiest way to communicate among threads in your application is with messages. Since this tutorial deals with worker threads, we’ll restrict this to having the worker thread post messages to the main application thread.
ThreadFunc()

So, now let’s walk through creating a simple worker thread that does nothing but update the progress control in a dialog box.

I’m going to assume you know how to create a dialog box, with a progress control, bound to a member variable in the dialog class. Do that now. You could also create a button that starts the thread.

OK, the first thing you need to do is create the thread’s controlling function. This can either be global or a class member, but I prefer to make it global because this separates the thread from the main process in my mind.

[code lang=”cpp”]UINT MyThreadFunc(LPVOID lParam); [/code]

All MFC thread “controllers” must be declared like that.

To call this function in a thread, we use the following code:

[code lang=”cpp”]
CWinThread *pThread = AfxBeginThread(MyThreadFunc, NULL, THREAD_PRIORITY_NORMAL, 0, 0); [/code]

This creates a separate thread using the MyThreadFunc function, passes a NULL for its one parameter, sets the priority to normal, gives it the same stack size as the calling thread, and starts the thread immediately. If the last parameter here is CREATE_SUSPENDED instead of 0, then the thread is created, but it does not start running until you explicitly tell it to.

This will successfully create a thread, which will run until MyThreadFunc returns. However, the calling thread will not know when that thread is done. Somehow, we have to pass the thread some information about the calling program.
Passing Information To a Thread
A thread controller can only have one parameter–the LPVOID argument. Therefore, it is often convenient to wrap up all the information we want to send the thread into a single struct:
[code lang=”cpp”]
typedef struct THREADINFOSTRUCT {
HWND hWnd;
CString someData;
} THREADINFOSTRUCT; [/code]
We can put any data we want in that structure, but one that should always be in there is a handle to the thread’s parent window. This will allow us to communicate with it.

Now, before we start the thread, let’s allocate some space for this structure. If we merely declare it on the stack with

[code lang=”cpp”]THREADINFOSTRUCT tis; [/code]

then as soon as this data goes out of scope, it will be destroyed. So let’s put it on the heap:

[code lang=”cpp”]THREADINFOSTRUCT *tis=new THREADINFOSTRUCT;
tis->hWnd=m_hWnd;
tis->someData=”This is in a thread.”; [/code]

And now we call the same function as before, passing tis:

[code lang=”cpp”]CWinThread *pThread = AfxBeginThread(MyThreadFunc,tis,
THREAD_PRIORITY_NORMAL,0,0); [/code]

OK, now we can pass some information to the thread. How do we let the thread tell the main process what’s going on?

Communicating with Threads
We can communicate to the calling window via a windows messages. First we have to define our own custom messages in our dialog class’s header file:

[code lang=”cpp”]#define WM_USER_THREAD_FINISHED (WM_USER+0x101)
#define WM_USER_THREAD_UPDATE_PROGRESS (WM_USER+0x102) [/code]
We also have to provide handlers for these messages in our dialog class:

[code lang=”cpp”]
afx_msg LRESULT OnThreadFinished(WPARAM wParam, LPARAM lParam);
afx_msg LRESULT OnThreadUpdateProgress(WPARAM wParam, LPARAM lParam); [/code]

All custom message handlers must follow that generic template. But we can interpret the parameters any way we want.

We must also manually update the message map with these two lines:

[code lang=”cpp”]
ON_MESSAGE(WM_USER_THREAD_FINISHED, OnThreadFinished)
ON_MESSAGE(WM_USER_THREAD_UPDATE_PROGRESS, OnThreadUpdateProgress) [/code]

And now we add the function definitions somewhere in our source file:

[code lang=”cpp”]
LRESULT CMyDialog::OnThreadFinished(WPARAM wParam, LPARAM lParam)
{
AfxMessageBox(“Thread has exited”);
return 0;
}
LRESULT CMyDialog::OnThreadUpdateProgress(WPARAM wParam, LPARAM lParam)
{
m_progress.SetPos(100*(int)wParam/(int)lParam);
return 0;
}
[/code]
So what should our thread do? In this example, not much:

[code lang=”cpp”]
UINT MyThreadFunc(LPVOID lParam)
{
THREADINFOSTRUCT* tis=(THREADINFOSTRUCT*)lParam;
for (int i=0;i<100;i++) {
PostMessage(tis->hWnd,WM_USER_THREAD_UPDATE_PROGRESS,i,100);
Sleep(100);
}
PostMessage(tis->hWnd,WM_USER_THREAD_FINISHED,0,0);
delete tis;
return 0;
} [/code]

Let’s analyze this. First we typecast the function’s argument into the structure type we passed. Then we just run through a simple loop that sends a message to the main thread to update the progress bar. We sleep for 100 ms just so it doesn’t go too fast that we don’t see it.

Next we send a message saying that our thread is finished.

Finally we delete the pointer to tis; Wait a second! Didn’t we define that in the main thread??? Yes, and it’s perfectly fine to allocate memory in one thread and free it in another. As long as we the programmer keep track of where things are happening. Alternatively, we could have set a class variable to hold that structure, and delete it in the [code lang=”cpp”]OnThreadFinished[/code] functioned. Either way is acceptable.

The function then returns, and the thread ends.

That’s all! It’s so easy! To see a working example project, look in the code tools section.

Of course, we can easily make it more complicated. Part II will talk about some synchronization methods used to control simultaneous access to objects from multiple threads. Now things can start becoming fun…
©2004 Ben Watson


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order: