Tag Archives: performance

Announcing Writing High-Performance .NET Code, 2nd Edition!

~~The new book will be available for sale on or around May 1, 2018. You can pre-order right now for Kindle in many Amazon markets, and for print as well in the US.~~

The book is out! Get it in the US, or go here for more markets and options.

The new book is a mix of greatly expanded coverage of old topics, as well as new material throughout. The fundamentals of .NET and how the GC and JIT work have not changed, not even in the face of .NET Core, but there have been many improvements, new APIs, and new ways of doing things that necessitated some updating and coverage of new topics.

It was also a great opportunity to incorporate a lot of the feedback I’ve received on the first edition.

Here’s a visualization of the book’s structure. I wrote the book in Scrivener and marked each section with a label depending on whether it was unchanged (green), modified (yellow), or completely new (purple). Click to expand to full size.

This is a good overview, but it doesn’t really convey what is exactly “new”–the newness is spread throughout the book in many forms. Here are some highlights:

Meta-content:

50% increase in content overall (from 68K words to over 100K words, or over 200 new pages).
Fixed all known errata (quite a bit more than on that list, honestly).
Incorporated feedback from hundreds of readers.
A new foreword by Vance Morrison, .NET Performance Architect at Microsoft.
Better-looking graphics.
Completely new typesetting system to give a more professional look. (You’ll only notice this in print and PDF formats.)
Automated build process that can produce new revisions of all formats much easier than before, allowing me to publish updates as needed.
Cut the Windows Phone chapter. (No one will miss this, right?)
Updated cover.

Content:

List of CLR performance improvements over time.
Significant increase in examples of using Visual Studio to diagnose CPU usage, heap state, memory allocations, and more, throughout entire book.
Introduced a new tool for analyzing .NET processes: Microsoft.Diagnostics.Runtime (“CLR MD”). It’s kind of like having programmatic access to a debugger, and it’s very powerful. I have many examples of its usage throughout the book, tied to specific diagnostic scenarios.
Benchmarking! I include actual benchmark results in many sections. Many code samples use a benchmarking library to show the differences in various approaches.
More on garbage collection, including advanced configuration options; reworked and more detailed explanations of the GC process including less-known information; discussion of recent libraries to help with pooling; examples of using weak references; object resurrection; stackalloc; more on finalization; more ways to diagnose problems; and much, much more. This was already the biggest chapter, and it got bigger.
More details about JIT, how to do custom warmup, how to figure out which code is jitted, and many more improvements.
Greatly expanded coverage of TPL, including the TPL Dataflow library, how to use tasks effectively, avoiding lock convoys, and many other smaller improvements to existing topics.
More guidance on struct and class design, such as immutability, thread safety, and more. When to use tuples, the new ValueTuple type.
Ref-returns and locals and when to use.
Much more content on collections, including effectively taking advantage of initial capacity; sorting; key comparisons; jagged vs. multi-dimensional arrays; analyzing the performance of various collection types; and more.
Discussion of SIMD commands, including examples and benchmarks.
A detailed analysis of LINQ and why it’s more expensive than you think.
More ways to consume and process ETW events.
A detailed example of building a Roslyn-style Code Analyzer and automatic fixer (FxCop replacement).
A new appendix with high-level tips for ADO.NET, ASP.NET, and WPF.
And a lot more…

See the book’s web-site for up-to-date details on where to buy, and other news.

Digging Into .NET Loop Performance, Bounds-checking, Iteration, and Unrolling

Introduction

In this article I am going to go into detail about how .NET treats loops involving array and collection access and what kinds of optimizations you can expect. I had mentioned some of these in a previous article, but in this article, we’re going to go deep and see many more examples. We’re going to dig into the IL that gets generated as well the x64 assembly code that actually executes.

Will these optimization matter to your program? Only if your program is CPU bound and collection iteration is a core part of your processing. As you’ll see, there are ways you can hurt your performance if you’re not careful, but that only matters if it was a significant part of your program to begin with. My philosophy is this: most people never need to know this, but if you do, then understanding every layer of the system is important so that you can make intelligent choices. That’s the reason I wrote my book, Writing High-Performance .NET Code, and why I write these articles—to give you the information about what’s actually going on when you compile and execute a .NET program.

Some of this information you can read elsewhere, but I want to show you how to get at this yourself so that you don’t have to rely on someone publically coming out with this information, especially as new version of the JIT are released.

Tools I used: ILSpy to view IL, Visual Studio 2013 to debug and view assembly with .NET 4.5. I compiled the code in Release mode, targeting x64 explicitly. When debugging, don’t start with debugging from Visual Studio–that will not give you an accurate picture of your code. Instead, start the program separately and attach the debugger. I accomplish this easily by putting a call to Debugger.Launch() in the beginning of Main(), and then executing the program with Ctrl-F5.

A note on assembly. I chose x64 because it should be the default for development these days. For register names mentioned below, understand that things like eax and rax refer to the same register–rax is the 64-bit version of this register, and is essentially a sign-extended version of eax, but they both refer to the same physical location. Same with ecx/rdx, ecx/rcx, etc.

Bounds Checks

Bounds checks exist in .NET because the CLR must ensure memory safety at all times in managed code. It needs to simply be impossible for managed code to corrupt memory outside of the defined data structures. This means that code like the following cannot be allowed to execute:

int[] array = new int[100];
array[100] = 1;

Attempting to execute this should result in an exception rather than allow the memory corruption to occur. That means that the CLR has to inject bounds checking into array access. This can make repeated array access twice as slow if some optimizations aren’t done. This knowledge makes some people leery of using .NET for high-performance code, but we’ll see that the JIT compiler does indeed make these optimizations in some cases (and falls short in others). However, remember what I said in the introduction: these kinds of optimizations only matter if you’re CPU bound and this code is in your innermost, most called-upon code. That said, even if you don’t have to worry about performance on this level, knowing how to investigate these types of details is instructive.

Baseline Loop

To start, let’s take a look at the most basic of loops:

[MethodImpl(MethodImplOptions.NoInlining)]
private static long GetSum(int[] array)
{ 
    long sum = 0; 
    for (int i =0; i < array.Length; i++)
    { 
        sum += array[i]; 
    } 
    return sum; 
}

It simply loops through the entire array and sums all the elements. I’m explicitly disabling inlining because that can muddy the analysis a bit, and we want to focus strictly on loop optimization, not other improvements that the JIT compiler can and will do here.

In IL, this method looks like this (with my own annotations):

.method private hidebysig static 
	int64 GetSum (
		int32[] 'array'
	) cil managed noinlining
{
	// Method begins at RVA 0x20a4
	// Code size 26 (0x1a)
	.maxstack 3
	.locals init (
		[0] int64 sum,
		[1] int32 i
	)

	IL_0000: ldc.i4.0    // Push 0 onto evaluation stack as a 4-byte integer
	IL_0001: stloc.0     // Pop from evaluation stack and store into local variable at location 0, sum.
	IL_0002: ldc.i4.0    // Push 0 to evaluation stack
	IL_0003: stloc.1     // Pop and store in local variable i.
	IL_0004: br.s IL_0012     //  Jump to IL_0012 (unconditionally)
	// loop start (head: IL_0010)
		IL_0006: ldloc.0    // Push sum onto stack
		IL_0007: ldarg.0    // Push array onto stack
		IL_0008: ldloc.1    // Push i onto stack
		IL_0009: ldelem.i4  // Pop array and i off the stack and push array[i] onto the stack
		IL_000a: add        // Pop array[i] and sum off the stack, add, and push result to stack
		IL_000b: stloc.0    // Pop result off stack and store in sum
		IL_000c: ldloc.1    // Push i onto stack
		IL_000d: ldc.i4.1   // Push 1 onto stack
		IL_000e: add        // Pop i and 1 off of stack, add, and push result to stack
		IL_000f: stloc.1    // Pop result off stack and store in i

		IL_0010: ldloc.1    // Push i onto stack
		IL_0011: ldarg.0    // Push array onto stack
		IL_0012: ldlen      // Pop array off of stack, get length, and push length onto stack
		IL_0013: conv.i4    // Convert length to 4-byte integer
		IL_0014: blt.s IL_0006    // Pop i and length off of stack. If i < length, jump to IL_0006
	// end loop

	IL_0016: ldloc.0    // Push sum onto stack
	IL_0017: ret        // Return
} // end of method Program::GetSum

That looks like an awful lot of work to execute a loop. IL is a simple, stack-based language that ends up being quite verbose, but this gets translated by the JIT compiler into real machine code that takes advantage of efficient CPU instructions and registers. The resulting assembly is this:

00007FFADFCC02D0  mov         rdx,rcx                 // Copy array address to rdx
00007FFADFCC02D3  xor         ecx,ecx                 // Set ecx (sum) to 0
00007FFADFCC02D5  mov         r10,qword ptr [rdx+8]   // Copy array length to r10
00007FFADFCC02D9  test        r10d,r10d               // See if length is 0
00007FFADFCC02DC  jle         00007FFADFCC0303        // If so, skip the loop
00007FFADFCC02DE  mov         r8d,ecx                 // Copy 0 to r8 (source of ecx is not relevant)
00007FFADFCC02E1  xor         r9d,r9d                 // Set r9 to 0 (memory offset into array)
00007FFADFCC02E4  nop         word ptr [rax+rax]      // NOP
00007FFADFCC02F0  mov         eax,dword ptr [rdx+r9+10h]  // Copy array[i] to eax
00007FFADFCC02F5  add         ecx,eax                 // Add value to sum
00007FFADFCC02F7  inc         r8d                     // i++
00007FFADFCC02FA  add         r9,4                    // Increment memory address by 4 bytes
00007FFADFCC02FE  cmp         r8d,r10d                // i < length ?
00007FFADFCC0301  jl          00007FFADFCC02F0        // If so, loop back a few statements for another iteration
00007FFADFCC0303  mov         eax,ecx                 // Copy sum to return register
00007FFADFCC0305  ret                                 // Return

There are a few things to notice here:

The code is tracking both the index value i as well as the direct memory address of the array values, which is far more efficient than doing a multiplication and offset on every loop iteration.
There are no array bounds checks! This is sometimes a worry with people new to .NET—how can we maintain memory safety without boundary checks? In this case, the JIT compiler determined that the boundary checks weren’t necessary because it knew all of the constraints of this loop.

Bringing Back the Bounds Check

Now let’s introduce some variability. If instead, we pass in some boundaries to iterate through, does this change the situation?

[MethodImpl(MethodImplOptions.NoInlining)]
private static int GetSum(int[] array, int start, int end)
{
    int sum = 0;
    for (int i = start; i < end; i++)
    {
        sum += array[i];
    }
    return sum;
}
.
.
.
// Called with:
GetSum(array, 25, 75);

The IL differences aren’t very interesting so I won’t show it, but the machine instructions definitely change:

00007FFADFCC0320  sub         rsp,28h               // Adjust stack pointer
00007FFADFCC0324  mov         r9,rcx                // Copy array address to r9
00007FFADFCC0327  xor         ecx,ecx               // Set ecx (sum) to 0
00007FFADFCC0329  cmp         edx,r8d               // edx has i in it, initialized with start (25). Compare that to r8, which has the end value (75)
00007FFADFCC032C  jge         00007FFADFCC0348      // If start >= end, skip loop.
00007FFADFCC032E  mov         r10,qword ptr [r9+8]  // Copy array length to r10
00007FFADFCC0332  movsxd      rax,edx               // Copy start to rax
00007FFADFCC0335  cmp         rax,r10               // Compare i to array length
00007FFADFCC0338  jae         00007FFADFCC0350      // If i >= length, jump to this address (below)
00007FFADFCC033A  mov         eax,dword ptr [r9+rax*4+10h]  // Copy array[i] to eax
00007FFADFCC033F  add         ecx,eax               // sum += array[i]
00007FFADFCC0341  inc         edx                   // i++
00007FFADFCC0343  cmp         edx,r8d               // Compare i to end value
00007FFADFCC0346  jl          00007FFADFCC0332      // If i < jump to beginning of loop (above)
00007FFADFCC0348  mov         eax,ecx               // Copy sum to return register
00007FFADFCC034A  add         rsp,28h               // Fix up stack pointer
00007FFADFCC034E  ret                               // Return

00007FFADFCC0350  call        00007FFB3F7E6590      // Throw an exception!

The most significant difference is that now on every single iteration of the loop, it’s comparing the array index to the length of the array, and if it goes over we’ll get an exception. With this change, we’ve doubled the running time of the loop. The bounds check is there.

So why does it do this now? Is it because we’re doing a subarray? That’s easy to test. You can modify the original function at the top of this article to iterate from 25 to 75 and you’ll see that the compiler does not insert a boundary check.

No, the reason for the boundary check in this case is because the compiler can’t prove that the input values are 100% safe.

Ok, then. What about checking the start and end variables before the loop? I suspect that quickly leads to the problem of trying to understand the code. As soon as the constraints are not constant, you have to deal with variable aliasing, side effects, and tracking the usage of each value—and that’s not really practical for a JIT compiler. Sure, in my simple method, maybe it would be easy, but the general case might be extremely difficult if not impossible.

Other Ways of Getting Bounds Checks

Even simple modifications to the indexing variable, that you can logically prove cannot exceed the array boundaries may cause the bounds check to appear:

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Manipulate(int[] array)
{
    for (int i=0; i < array.Length; i++)
    {
        array[i] = array[i / 2];
    }
}

There is no way i / 2 can exceed array.Length, but the bounds check is still there:

00007FFADFCE03B0  sub         rsp,28h              // Adjust stack pointer  
00007FFADFCE03B4  mov         r8,qword ptr [rcx+8] // Copy array length to r8
00007FFADFCE03B8  test        r8d,r8d              // Test if length is 0
00007FFADFCE03BB  jle         00007FFADFCE03E8     // If 0, skip loop
00007FFADFCE03BD  xor         r9d,r9d              // Set i to 0
00007FFADFCE03C0  xor         r10d,r10d            // Set destination index to 0
00007FFADFCE03C3  mov         eax,r9d              // Set eax (source array index) to 0
00007FFADFCE03C6  cdq                              // sign extends eax into edx (needed for division)
00007FFADFCE03C7  sub         eax,edx              // eax = eax - edx 
00007FFADFCE03C9  sar         eax,1                // Shift right (divide by two)
00007FFADFCE03CB  movsxd      rax,eax              // copy 32-bit eax into 64-bit rax and sign extend
00007FFADFCE03CE  cmp         rax,r8               // Compare source index against array length
00007FFADFCE03D1  jae         00007FFADFCE03F0     // If greater than or equal, jump and throw exception
00007FFADFCE03D3  mov         eax,dword ptr [rcx+rax*4+10h]  // Copy value from array[i/2] to eax
00007FFADFCE03D7  mov         dword ptr [rcx+r10+10h],eax    // Copy value from eax to array[i]
            
00007FFADFCE03DC  inc         r9d                  // Increment i
00007FFADFCE03DF  add         r10,4                // Increment destination address by 4 bytes
00007FFADFCE03E3  cmp         r9d,r8d              // Compare i against array length
00007FFADFCE03E6  jl          00007FFADFCE03C3     // If less, loop again
00007FFADFCE03E8  add         rsp,28h              // Adjust stack
00007FFADFCE03EC  ret                              // return

00007FFADFCE03F0  call        00007FFB3F7E6590     // Throw exception

So even though we can prove that it’s safe, the compiler won’t.

There’s another way, that’s a bit more annoying. Remember the very first, baseline example, the one that did NOT have a bounds check. What if we had that exact code, but it operated on a static value rather than argument?

00007FFADFCE0500  mov         rax,0A42A235780h  
00007FFADFCE050A  mov         rax,qword ptr [rax]  
00007FFADFCE050D  movsxd      r9,ecx  
00007FFADFCE0510  mov         r8,qword ptr [rax+8]  
00007FFADFCE0514  cmp         r9,r8                 // bounds check!
00007FFADFCE0517  jae         00007FFADFCE0530  
00007FFADFCE0519  mov         eax,dword ptr [rax+r9*4+10h]  
00007FFADFCE051E  add         edx,eax  
00007FFADFCE0520  inc         ecx  
00007FFADFCE0522  cmp         ecx,r8d  
00007FFADFCE0525  jl          00007FFADFCE0500  
00007FFADFCE0527  mov         eax,edx  
00007FFADFCE0529  add         rsp,28h  
00007FFADFCE052D  ret  
                            
00007FFADFCE0530  call        00007FFB3F7E6590

The bounds check is there.

Ok, so what if we’re clever and we pass the static array to our baseline method, which doesn’t know it’s static? There is no bounds check in this case–it’s exactly the same as the baseline case.

Another case where the JIT fails to optimize where it could is iterating an array backwards:

[MethodImpl(MethodImplOptions.NoInlining)]
private static int GetBackwardsSum(int[] array)
{
    int sum = 0;
    for (int i = array.Length - 1; i >= 0; i--)
    {
        sum += array[i];
    }
    return sum;
}

Sad, but true–as much as this looks like the baseline example, it does not have the optimization.

Finally, consider the case of repeated access:

for (int i = 2;i < array.Length - 1; i++)
{
    int x = array[i-2];
    int y = array[i-2];
}

In this case, the compiler will emit a single bounds check that suffices for both accesses.

Using fixed to Eliminate Bounds Checks

Another way to eliminate bounds checks is to use the fixed keyword. I don’t really recommend it because it has safety and security considerations that you should definitely consider. However, if you really have a need for optimal array access with some nontrivial access pattern that would otherwise trigger a bounds check, this may work for you. In order to use fixed, you must enable unsafe code for your project.

Here are two options for using fixed to iterate through an array:

[MethodImpl(MethodImplOptions.NoInlining)]
unsafe private static int GetBackwardsSumFixedSubscript(int[] array)
{
    int sum = 0;
    fixed (int* p = array)
    {
        for (int i = array.Length - 1; i >= 0; i--)
        {
            sum += p[i];
        }
    }
    return sum;
}

[MethodImpl(MethodImplOptions.NoInlining)]
unsafe private static int GetBackwardsSumFixedPointer(int[] array)
{
    int sum = 0;
    fixed (int* p = &array[array.Length - 1])
    {
        int* v = p;
        for (int i = array.Length - 1; i >= 0; i--)
        {
            sum += *v;
            v--;
        }
    }
    return sum;
}

Either one will work, but keep in mind that unsafe really means unsafe! You can trample your memory in this method. I recommend keeping all unsafe code segregated to its own module, class, or otherwise limited scope and that it receives extra scrutiny during code reviews. You may also have to deal with organizational security issues with deploying unsafe code. If you can eliminate the bounds checking with pure managed code, strongly prefer that approach instead.

Foreach vs. For

Now, let’s shift our focus up one level of abstraction. In .NET, you can iterate over most collections with the foreach keyword. foreach is in most cases syntactic sugar for calling methods like GetEnumerator(), MoveNext(), and so forth. In this respect, it would be fair to assume that foreach has worse performance overall than a simple for or while loop.

However, that’s not quite the case. In some instances, the compiler will recognize a simple for loop on an array for what it is, regardless of the syntax used.

Foreach on Arrays

If we take our baseline loop from above and convert it to foreach, like this:

private static int GetSumForeach(int[] array)
{
    int sum = 0;
    foreach(var val in array)
    {
        sum += val;
    }
    return sum;
}

What is the generated IL in this case?

.method private hidebysig static 
	int32 GetSumForeach (
		int32[] 'array'
	) cil managed 
{
	// Method begins at RVA 0x2098
	// Code size 28 (0x1c)
	.maxstack 2
	.locals init (
		[0] int32 sum,
		[1] int32 val,
		[2] int32[] CS$6$0000,
		[3] int32 CS$7$0001
	)

	IL_0000: ldc.i4.0
	IL_0001: stloc.0
	IL_0002: ldarg.0
	IL_0003: stloc.2
	IL_0004: ldc.i4.0
	IL_0005: stloc.3
	IL_0006: br.s IL_0014
	// loop start (head: IL_0014)
		IL_0008: ldloc.2
		IL_0009: ldloc.3
		IL_000a: ldelem.i4
		IL_000b: stloc.1
		IL_000c: ldloc.0
		IL_000d: ldloc.1
		IL_000e: add
		IL_000f: stloc.0
		IL_0010: ldloc.3
		IL_0011: ldc.i4.1
		IL_0012: add
		IL_0013: stloc.3

		IL_0014: ldloc.3
		IL_0015: ldloc.2
		IL_0016: ldlen
		IL_0017: conv.i4
		IL_0018: blt.s IL_0008
	// end loop

	IL_001a: ldloc.0
	IL_001b: ret
} // end of method Program::GetSumForeach

Compare that to the IL from the baseline loop. The only difference is the presence of some additional, generated local variables to store the array and index variable. Otherwise, it looks exactly like a loop. And, in fact, the assembly code also looks exactly the same (including elimination of the bounds check):

00007FFADFCB01F0  mov         rdx,rcx  
00007FFADFCB01F3  xor         ecx,ecx  
00007FFADFCB01F5  mov         r10,qword ptr [rdx+8]  
00007FFADFCB01F9  test        r10d,r10d  
00007FFADFCB01FC  jle         00007FFADFCB0223  
00007FFADFCB01FE  mov         r8d,ecx  
00007FFADFCB0201  xor         r9d,r9d  
00007FFADFCB0204  nop         word ptr [rax+rax]  
00007FFADFCB0210  mov         eax,dword ptr [rdx+r9+10h]  
00007FFADFCB0215  add         ecx,eax  
00007FFADFCB0217  inc         r8d  
00007FFADFCB021A  add         r9,4  
00007FFADFCB021E  cmp         r8d,r10d  
00007FFADFCB0221  jl          00007FFADFCB0210  
00007FFADFCB0223  mov         eax,ecx  
00007FFADFCB0225  ret

For loop on List<T>

Before seeing foreach operate on List<T>, we should see what for looks like:

private static int GetSumFor(List<int> array)
{
    int sum = 0;
    for (int i = 0; i < array.Count; i++)
    {
        sum += array[i];
    }
    return sum;
}

And the corresponding IL:

.method private hidebysig static 
	int32 GetSumFor (
		class [mscorlib]System.Collections.Generic.List`1 'array'
	) cil managed 
{
	// Method begins at RVA 0x20a0
	// Code size 31 (0x1f)
	.maxstack 3
	.locals init (
		[0] int32 sum,
		[1] int32 i
	)

	IL_0000: ldc.i4.0
	IL_0001: stloc.0
	IL_0002: ldc.i4.0
	IL_0003: stloc.1
	IL_0004: br.s IL_0014
	// loop start (head: IL_0014)
		IL_0006: ldloc.0
		IL_0007: ldarg.0
		IL_0008: ldloc.1
		IL_0009: callvirt instance !0 class [mscorlib]System.Collections.Generic.List`1::get_Item(int32)
		IL_000e: add
		IL_000f: stloc.0
		IL_0010: ldloc.1
		IL_0011: ldc.i4.1
		IL_0012: add
		IL_0013: stloc.1

		IL_0014: ldloc.1
		IL_0015: ldarg.0
		IL_0016: callvirt instance int32 class [mscorlib]System.Collections.Generic.List`1::get_Count()
		IL_001b: blt.s IL_0006
	// end loop

	IL_001d: ldloc.0
	IL_001e: ret
} // end of method Program::GetSumFor

It looks pretty much like the for loop on an array, but with some method calls instead of direct memory accesses to retrieve the values and the length (Count).

However, the assembly code generated from this is more interesting and shows some subtle and important differences:

00007FFADFCD0BE0  push        rbx  	// Stack frame setup
00007FFADFCD0BE1  push        rsi  
00007FFADFCD0BE2  push        rdi  
00007FFADFCD0BE3  sub         rsp,20h  
00007FFADFCD0BE7  mov         rdx,rcx    // Copy array address to rdx
00007FFADFCD0BEA  xor         ecx,ecx    // Set i to 0. i is used only for the loop condition.
00007FFADFCD0BEC  xor         r11d,r11d  // Set j to 0. j is used as the iterator on the internal array inside List
00007FFADFCD0BEF  mov         r8d,ecx    // Set sum to 0
00007FFADFCD0BF2  mov         ebx,dword ptr [rdx+18h]    // Copy length to ebx
// LOOP START
00007FFADFCD0BF5  cmp         ecx,ebx                    // Compare i to length
00007FFADFCD0BF7  jl          00007FFADFCD0C00           // if less, jump below
00007FFADFCD0BF9  mov         eax,r8d                    // else copy sum to return value
00007FFADFCD0BFC  jmp         00007FFADFCD0C26           // jump to end of function
00007FFADFCD0BFE  xchg        ax,ax                      // No-op

00007FFADFCD0C00  mov         eax,dword ptr [rdx+18h]    // Copy list's length to eax
00007FFADFCD0C03  cmp         ecx,eax                    // Compare i to length
00007FFADFCD0C05  jae         00007FFADFCD0C35           // if greater than or equal, jump and throw exception
00007FFADFCD0C07  mov         r9,qword ptr [rdx+8]       // Copy List's array address to r9
00007FFADFCD0C0B  movsxd      r10,r11d                   // Copy j  to r10 and sign extend.
00007FFADFCD0C0E  mov         rax,qword ptr [r9+8]       // Copy array's length to rax
00007FFADFCD0C12  cmp         r10,rax                    // Compare j to array' length
00007FFADFCD0C15  jae         00007FFADFCD0C30           // If greater than or equal, jump and throw exception
00007FFADFCD0C17  mov         eax,dword ptr [r9+r10*4+10h]  // copy array[j] (list[i]) to eax
00007FFADFCD0C1C  add         r8d,eax                    // sum += array[j]
00007FFADFCD0C1F  inc         ecx                        // i++
00007FFADFCD0C21  inc         r11d                       // j++
00007FFADFCD0C24  jmp         00007FFADFCD0BF5           // Jump to beginning of loop 
00007FFADFCD0C26  add         rsp,20h                    // Clean up stack
00007FFADFCD0C2A  pop         rdi  
00007FFADFCD0C2B  pop         rsi  
00007FFADFCD0C2C  pop         rbx  
00007FFADFCD0C2D  ret                                    // return

00007FFADFCD0C30  call        00007FFB3F7E6590           // Throw exception

00007FFADFCD0C35  mov         ecx,0Dh                    
00007FFADFCD0C3A  call        00007FFB3CD990F8           // Throw exception

Here’s what’s going on:

The method calls on List<int> have been inlined away—we get direct memory access to the underlying array that List<int> maintains.
There is an iteration on i that compares it to the length of the List<int>.
There is a separate iteration on j (which I call j in the annotations), which compares it against the length of the internal array that List<int> wraps.
- Failing either iteration will cause an exception to be thrown.

The duplicate iterations don’t really cause the loop to be twice as slow as a pure array would be—there is still only one memory access and every thing else is operating on registers, which is extremely efficient. But if you have the choice between arrays and List<T>, well, it’s hard to beat an array for pure efficiency.

Foreach on List<T>

Finally, let’s turn our attention to foreach on List<T>. Will the compiler (or the JIT) turn this into a straightforward for loop?

No. Here’s the IL:

.method private hidebysig static 
	int32 GetSumForeach (
		class [mscorlib]System.Collections.Generic.List`1 list
	) cil managed 
{
	// Method begins at RVA 0x20f4
	// Code size 50 (0x32)
	.maxstack 2
	.locals init (
		[0] int32 sum,
		[1] int32 val,
		[2] valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator CS$5$0000
	)

	IL_0000: ldc.i4.0
	IL_0001: stloc.0
	IL_0002: ldarg.0
	IL_0003: callvirt instance valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator< !0> class [mscorlib]System.Collections.Generic.List`1::GetEnumerator()
	IL_0008: stloc.2
	.try
	{
		IL_0009: br.s IL_0017
		// loop start (head: IL_0017)
			IL_000b: ldloca.s CS$5$0000
			IL_000d: call instance !0 valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator::get_Current()
			IL_0012: stloc.1
			IL_0013: ldloc.0
			IL_0014: ldloc.1
			IL_0015: add
			IL_0016: stloc.0

			IL_0017: ldloca.s CS$5$0000
			IL_0019: call instance bool valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator::MoveNext()
			IL_001e: brtrue.s IL_000b
		// end loop

		IL_0020: leave.s IL_0030
	} // end .try
	finally
	{
		IL_0022: ldloca.s CS$5$0000
		IL_0024: constrained. valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator
		IL_002a: callvirt instance void [mscorlib]System.IDisposable::Dispose()
		IL_002f: endfinally
	} // end handler

	IL_0030: ldloc.0
	IL_0031: ret
} // end of method Program::GetSumForeach

I won’t show the assembly code for this, but suffice it to say, it’s also quite different.

Foreach on (IEnumerable<T>)array

What would you expect to happen with code like this?

private static int GetSumForeach(int[] array)
{
    var collection = array as IEnumerable<int>;
    int sum = 0;
    foreach (var val in collection)
    {
        sum += val;
    }
    return sum;
}

In this case, the array optimization will be lost and you’ll get the same thing you did with foreach over List<T>.

Loop Unrolling

Loop unrolling is the practice of reducing the overhead of a loop by causing fewer iterations, but more work per iteration. For loop-heavy code, it can make a big difference.

You can do this unrolling yourself, such as with the following example:

private static int GetSum(int[] array)
{
    int sum = 0;
    for (int i = 0; i < array.Length-4; i+=4)
    {
        sum += array[i];
        sum += array[i+1];
        sum += array[i+2];
        sum += array[i+3];
    }
    return sum;
}

In some cases, the compiler can do this kind of optimization for you, but for loop unrolling to take place, there is one big constraint that must be obeyed: The loop constraints must be statically determinable so that the compiler can know how to modify the loop.

This is why we don’t see any loop unrolling in the assembly code for any of our previous examples. The length of the loop was always variable. What if, instead, we have something like this:

const int ArrayLength = 100;

private static int GetSum(int[] array)
{
    int sum = 0;
    for (int i = 0; i < ArrayLength; i++)
    {
        sum += array[i];
    }
    return sum;
}

The IL for this method looks about the same as before, but the assembly shows the optimization:

00007FFADFCE01C0  sub         rsp,28h              // Setup stack frame
00007FFADFCE01C4  mov         rdx,rcx              // Copy array address to rdx
00007FFADFCE01C7  xor         ecx,ecx              // Set sum to 0
00007FFADFCE01C9  xor         r8d,r8d              // Set address offset to 0
00007FFADFCE01CC  mov         rax,qword ptr [rdx+8]// Copy array length to rax
00007FFADFCE01D0  mov         r9d,60h              // Copy value 96 to r9
00007FFADFCE01D6  cmp         r9,rax               // Compare 96 to length
00007FFADFCE01D9  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01DB  mov         r9d,61h              // Copy value 97 to r9
00007FFADFCE01E1  cmp         r9,rax               // Compare to length
00007FFADFCE01E4  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01E6  mov         r9d,62h              // Copy value 98 to r9
00007FFADFCE01EC  cmp         r9,rax               // Compare to length
00007FFADFCE01EF  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01F1  mov         r9d,63h              // Copy value 99 to r9
00007FFADFCE01F7  cmp         r9,rax               // Compare to length
00007FFADFCE01FA  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01FC  nop         dword ptr [rax]      // Nop
// LOOP START
00007FFADFCE0200  mov         eax,dword ptr [rdx+r8+10h]  // Copy array[i] to eax
00007FFADFCE0205  add         ecx,eax                     //sum += array[i]
00007FFADFCE0207  mov         eax,dword ptr [rdx+r8+14h]  // Copy array[i+1] to eax
00007FFADFCE020C  add         ecx,eax                     // sum += array[i+1]
00007FFADFCE020E  mov         eax,dword ptr [rdx+r8+18h]  // Copy array[i+2] to eax
00007FFADFCE0213  add         ecx,eax                     // sum += array[i+2]
00007FFADFCE0215  mov         eax,dword ptr [rdx+r8+1Ch]  // Copy array[i+3] to eax
00007FFADFCE021A  add         ecx,eax                     // sum += array[i+3]
00007FFADFCE021C  add         r8,10h                      // Increment address offset by 16 bytes (4 integers)
00007FFADFCE0220  cmp         r8,190h                     // Compare address offset to 400
00007FFADFCE0227  jl          00007FFADFCE0200            // If less, loop around
00007FFADFCE0229  mov         eax,ecx                     // Copy sum to return value
00007FFADFCE022B  add         rsp,28h                     // Fix up stack
00007FFADFCE022F  ret                                     // return

00007FFADFCE0230  call        00007FFB3F7E6590            // throw exception

The basic operation of this is:

Check that we can safely do increments of N values and overflow the array (N being 4 in this specific case).
Do N operations per loop body.

As soon as you deviate from the static determinism, however, you will not be able to rely on this optimization. However, if you have special knowledge of your arrays (such as that their length is always a multiple of 4), then you can do the loop unrolling yourself in the source code.

As far as the current JIT compiler goes, you will lose this optimization if you change the array size in this example to 101. Some compilers will split this–an unrolled loop, with the leftovers tacked onto the end. The JIT compiler doesn’t do this at this time (remember, it’s under a severe time constraint).

The ultimate loop unrolling of course is just eliminating the loop altogether and listing out every operation individually. For small operations, this could be advisable–it’s just a tradeoff between code maintenance and performance that you have to make.

Summary

Understanding how small differences in code translate into more or less efficient machine code is important if you need to get the last small measure of performance on your more performance-critical components, especially if you are CPU bound.

Of all of the collections, arrays are the most efficient to access, but List<T> is very close. When iterating over arrays, you want to try to eliminate the bounds check if possible. When deciding between for and foreach, keep in mind that foreach can involve method calls for anything except pure arrays.

Understand the conditions that allow for loop unrolling. If you can formulate your code to allow it, and it makes a difference, go for it, or manually unroll loops if you get a performance gain out of it.

Most of all, figure out how to answer these types of questions for yourself so that the next time deep performance questions arise, you know how to quickly determine the answer.

Book Review: High-Performance Windows Store Apps

I recently read Brian Rasmussen’s book High-Performance Windows Store Apps, and I think it’s an excellent companion to my own Writing High-Performance .NET Code.

It’s a good companion because while my book is all about the nitty-gritty details of .NET and how they effect performance, I don’t go too much into things at the UI layer. My perspective is much more systems and servers based, while Brian is coming at this from a UI/XAML focus. Brian covers a bunch of topics in a very good way.

The title specifically mentions “Windows Store” apps, but the principles and techniques described in here apply to any desktop/UI application, not just the types you see typically on tablets or phones. It’s worthwhile even if you develop just desktop apps.

Particular things I liked:

Chapter 3 – Designing for Performance – It presents a number of interesting performance scenarios, many of which aren’t obvious if you’re new to the new world of multi-device, cloud-connected apps we’re in. The discussion on resource management and prioritization was good too.
Chapter 4 – Instrumentation – A very good walkthrough of Event Tracing for Windows, which is what every developer needs to know these days, for tracking app events for performance, debugging, or just plain logging. Lots of good stuff here, and critical for all performance analyses these days.
Chapter 5 – Performance testing – Lots of good stuff in here I didn’t know about, particularly around building an automated performance testing environment for UI apps. Automation is critical, and this gives some good examples to get you on your way.

The book used the Windows Performance Recorder for most of its examples. It’s an excellent tutorial on how to make this tool useful to you, without getting bogged down into the complexities.

Overall, highly recommended. There are not very many guides on how to do performance engineering for XAML apps, and Brian writes a great one.

You can follow Brian on Twitter @kodehoved. Check out High-Performance Windows Store Apps at Amazon.

Appearance on .NET Rocks! Podcast

Carl and Richard put together a great podcast. .NET Rocks! has existed for years now and it’s amazing how many episodes they’ve published.

A couple of weeks ago, I had the privilege of recording their latest episode with them, #1041. We talked about a ton of interesting things like the importance of memory management, precise measurement, using the correct tools, not being afraid of the debugger, a little bit about Microsoft culture, and even LEGO!

Here is their description:

Carl and Richard talk to Ben Watson about his work around writing high performance .NET code. Ben talks about how the Bing team decided to use .NET code internally, which seems like an obvious choice for a Microsoft group, but it isn’t really – when milliseconds count, does .NET makes sense? Ben says it does, and he’s done the work to prove it. Ben’s book “Writing High Performance .NET Code” focuses not only on coding techniques, but also the larger practice of having a deep understanding of how .NET works, and the processes that take place to turn .NET code into machine code. The conversation also digs deeply into the need for performance measurement, especially Event Tracing for Windows. .NET can be fast when you do it right!

Give a listen. Subscribe in iTunes or listen on the web. Let me know what you think!

Digging Into .NET Object Allocation Fundamentals

[Note: this article also appeared on CodeProject]

Introduction

While understanding garbage collection fundamentals is vital to working with .NET, it is also important to understand how object allocation works. It shows you just how simple and performant it is, especially compared to the potentially blocking nature of native heap allocations. In a large, native, multi-threaded application, heap allocations can be major performance bottleneck which requires you to perform all sorts of custom heap management techniques. It’s also harder to measure when this is happening because many of those details are hidden behind the OS’s allocation APIs. More importantly, understanding this will give you clues to how you can mess up and make object allocation far less efficient.

In this article, I want to go through an example taken from Chapter 2 of Writing High-Performance .NET Code and then take it further with some additional examples that weren’t covered in the book.

Viewing Object Allocation in a Debugger

Let’s start with a simple object definition: completely empty.

class MyObject 
{
}

static void Main(string[] args)
{
    var x = new MyObject();
}

In order to examine what happens during allocation, we need to use a “real” debugger, like Windbg. Don’t be afraid of this. If you need a quick primer on how to get started, look at the free sample chapter on this page, which will get you up and running in no time. It’s not nearly as bad you think.

Build the above program in Release mode for x86 (you can do x64 if you’d like, but the samples below are x86).

In Windbg, follow these steps to start and debug the program:

Ctrl+E to execute a program. Navigate to and open the built executable file.
Run command: sxe ld clrjit (this tells the debugger to break on loading any assembly with clrjit in the name, which you need loaded before the next steps)
Run command: g (continues execution)
When it breaks, run command: .loadby sos clr (loads .NET debugging tools)
Run command: !bpmd ObjectAllocationFundamentals Program.Main (Sets a breakpoint at the beginning of a method. The first argument is the name of the assembly. The second is the name of the method, including the class it is in.)
Run command: g

Execution will break at the beginning of the Main method, right before new() is called. Open the Disassembly window to see the code.

Here is the Main method’s code, annotated for clarity:

; Copy method table pointer for the class into
; ecx as argument to new()
; You can use !dumpmt to examine this value.
mov ecx,006f3864h
; Call new
call 006e2100 
; Copy return value (address of object) into a register
mov edi,eax

Note that the actual addresses will be different each time you execute the program. Step over (F10, or toolbar) a few times until call 006e2100 (or your equivalent) is highlighted. Then Step Into that (F11). Now you will see the primary allocation mechanism in .NET. It’s extremely simple. Essentially, at the end of the current gen0 segment, there is a reserved bit of space which I will call the allocation buffer. If the allocation we’re attempting can fit in there, we can update a couple of values and return immediately without more complicated work.

If I were to outline this in pseudocode, it would look like this:

if (object fits in current allocation buffer)
{
   Increment a pointer, return address;
}
else
{
   call JIT_New to do more complicated work in CLR
}

The actual assembly looks like this:

; Set eax to value 0x0c, the size of the object to
; allocate, which comes from the method table
006e2100 8b4104          mov     eax,dword ptr [ecx+4] ds:002b:006f3868=0000000c
; Put allocation buffer information into edx
006e2103 648b15300e0000  mov     edx,dword ptr fs:[0E30h]
; edx+40 contains the address of the next available byte
; for allocation. Add that value to the desired size.
006e210a 034240          add     eax,dword ptr [edx+40h]
; Compare the intended allocation against the
; end of the allocation buffer.
006e210d 3b4244          cmp     eax,dword ptr [edx+44h]
; If we spill over the allocation buffer,
; jump to the slow path
006e2110 7709            ja      006e211b
; update the pointer to the next free
; byte (0x0c bytes past old value)
006e2112 894240          mov     dword ptr [edx+40h],eax
; Subtract the object size from the pointer to
; get to the start of the new obj
006e2115 2b4104          sub     eax,dword ptr [ecx+4]
; Put the method table pointer into the
; first 4 bytes of the object.
; eax now points to new object
006e2118 8908            mov     dword ptr [eax],ecx
; Return to caller
006e211a c3              ret
; Slow Path - call into CLR method
006e211b e914145f71      jmp     clr!JIT_New (71cd3534)

In the fast path, there are only 9 instructions, including the return. That’s incredibly efficient, especially compared to something like malloc. Yes, that complexity is traded for time at the end of object lifetime, but so far, this is looking pretty good!

What happens in the slow path? The short answer is a lot. The following could all happen:

A free slot somewhere in gen0 needs to be located
A gen0 GC is triggered
A full GC is triggered
A new memory segment needs to be allocated from the operating system and assigned to the GC heap
Objects with finalizers need extra bookkeeping
Possibly more…

Another thing to notice is the size of the object: 0x0c (12 decimal) bytes. As covered elsewhere, this is the minimum size for an object in a 32-bit process, even if there are no fields.

Now let’s do the same experiment with an object that has a single int field.

class MyObjectWithInt { int x; }

Follow the same steps as above to get into the allocation code.

The first line of the allocator on my run is:

00882100 8b4104          mov     eax,dword ptr [ecx+4] ds:002b:00893874=0000000c

The only interesting thing is that the size of the object (0x0c) is exactly the same as before. The new int field fit into the minimum size. You can see this by examining the object with the !DumpObject command (or the abbreviated version: !do). To get the address of the object after it has been allocated, step over instructions until you get to the ret instruction. The address of the object is now in the eax register, so open up the Registers view and see the value. On my computer, it has a value of 2372770. Now execute the command: !do 2372770

You should see similar output to this:

0:000> !do 2372770
Name:        ConsoleApplication1.MyObjectWithInt
MethodTable: 00893870
EEClass:     008913dc
Size:        12(0xc) bytes
File:        D:\Ben\My Documents\Visual Studio 2013\Projects\ConsoleApplication1\ConsoleApplication1\bin\Release\ConsoleApplication1.exe
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
70f63b04  4000001        4         System.Int32  1 instance        0 x

This is curious. The field is at offset 4 (and an int has a length of 4), so that only accounts for 8 bytes (range 0-7). Offset 0 (i.e., the object’s address) contains the method table pointer, so where are the other 4 bytes? This is the sync block and they are actually at offset -4 bytes, before the object’s address. These are the 12 bytes.

Try it with a long.

class MyObjectWithLong { long x; }

The first line of the allocator is now:

00f22100 8b4104          mov     eax,dword ptr [ecx+4] ds:002b:00f33874=00000010

Showing a size of 0x10 (decimal 16 bytes), which we would expect now. 12 byte minimum object size, but 4 already in the overhead, so an extra 4 bytes for the 8 byte long. And an examination of the allocated object shows an object size of 16 bytes as well.

0:000> !do 2932770
Name:        ConsoleApplication1.MyObjectWithLong
MethodTable: 00f33870
EEClass:     00f313dc
Size:        16(0x10) bytes
File:        D:\Ben\My Documents\Visual Studio 2013\Projects\ConsoleApplication1\ConsoleApplication1\bin\Release\ConsoleApplication1.exe
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
70f5b524  4000002        4         System.Int64  1 instance 0 x

If you put an object reference into the test class, you’ll see the same thing as you did with the int.

Finalizers

Now let’s make it more interesting. What happens if the object has a finalizer? You may have heard that objects with finalizers have more overhead during GC. This is true–they will survive longer, require more CPU cycles, and generally cause things to be less efficient. But do finalizers also affect object allocation?

Recall that our Main method above looked like this:

mov ecx,006f3864h
call 006e2100 
mov edi,eax

If the object has a finalizer, however, it looks like this:

mov     ecx,119386Ch
call    clr!JIT_New (71cd3534)
mov     esi,eax

We’ve lost our nifty allocation helper! We have to now jump directly to JIT_New. Allocating an object that has a finalizer is a LOT slower than a normal object. More internal CLR structures need to be modified to track this object’s lifetime. The cost isn’t just at the end of object lifetime.

How much slower is it? In my own testing, it appears to be about 8-10x worse than the fast path of allocating a normal object. If you allocate a lot of objects, this difference is considerable. For this, and other reasons, just don’t add a finalizer unless it really is required.

Calling the Constructor

If you are particularly eagle-eyed, you may have noticed that there was no call to a constructor to initialize the object once allocated. The allocator is changing some pointers, returning you an object, and there is no further function call on that object. This is because memory that belongs to a class field is always pre-initialized to 0 for you and these objects had no further initialization requirements. Let’s see what happens if we change to the following definition:

class MyObjectWithInt { int x = 13; }

Now the Main function looks like this:

mov     ecx,0A43834h
; Allocate memory
call    00a32100
; Copy object address to esi
mov     esi,eax
; Set object + 4 to value 0x0D (13 decimal)
mov     dword ptr [esi+4],0Dh

The field initialization was inlined into the caller!

Note that this code is exactly equivalent:

class MyObjectWithInt { int x; public MyObjectWithInt() { this.x = 13; } }

But what if we do this?

class MyObjectWithInt 
{ 
    int x; 

    [MethodImpl(MethodImplOptions.NoInlining)]  
    public MyObjectWithInt() 
    { 
        this.x = 13; 
    } 
}

This explicitly disables inlining for the object constructor. There are other ways of preventing inlining, but this is the most direct.

Now we can see the call to the constructor happening after the memory allocation:

mov     ecx,0F43834h
call    00f32100
mov     esi,eax
mov     ecx,esi
call    dword ptr ds:[0F43854h]

Exercise for the Reader

Can you get the allocator shown above to jump to the slow path? How big does the allocation request have to be to trigger this? (Hint: Try allocating arrays of various sizes.) Can you figure this out by examining the registers and other values from the running code?

Summary

You can see that in most cases, allocation of objects in .NET is extremely fast and efficient, requiring no calls into the CLR and no complicated algorithms in the simple case. Avoid finalizers unless absolutely needed. Not only are they less efficient during cleanup in a garbage collection, but they are slower to allocate as well.

Play around with the sample code in the debugger to get a feel for this yourself. If you wish to learn more about .NET memory handling, especially garbage collection, take a look at the book Writing High-Performance .NET Code.

Article about Class Design and General .NET Coding

I modified Chapter 5 from Writing High-Performance .NET Code and posted it as an article at CodeProject. Take a look and tell me what you think!

Announcing Writing High-Performance .NET Code

This blog has been silent for far too long. That’s because I’ve been heads-down on a side project for the last 10 months. I’d like to announce my latest technical book:

Writing High-Performance .NET Code

If you write managed code, you want this book. If you have friends who write managed code, they want this, even if they don’t know it yet.

Do you want your .NET code to have the absolute best performance it can? This book demystifies the CLR, teaching you how and why to write code with optimum performance. Learn critical lessons from a person who helped design and build one of the largest high-performance .NET systems in the world.

This book does not just teach you how the CLR works—it teaches you exactly what you need to do now to obtain the best performance today. It will expertly guide you through the nuts and bolts of extreme performance optimization in .NET, complete with in-depth examinations of CLR functionality, free tool recommendations and tutorials, useful anecdotes, and step-by-step guides to measure and improve performance.

Among the topics you will learn are how to:

Choose what to measure and why
Use many amazing tools, freely available, to solve problems quickly
Understand the .NET garbage collector and its effect on your application
Use effective coding patterns that lead to optimal garbage collection performance
Diagnose common GC-related issues
Reduce costs of JITting
Use multiple threads sanely and effectively, avoiding synchronization problems
Know which .NET features and APIs to use and which to avoid
Use code generation to avoid performance problems
Measure everything and expose hidden performance issues
Instrument your program with performance counters and ETW events
Use the latest and greatest .NET features
Ensure your code can run on mobile devices without problems
Build a performance-minded team

…and much more.

See http://www.writinghighperf.net for up-to-date information about the book. You can also like the Facebook page or subscribe to this blog to see updates.

The book is currently available via Amazon and Kobo. Barnes and Noble is pending. More retailers and formats will follow. See the Buy page to check for current availability.

I will also be posting some blog entries with topics inspired by the book, but weren’t quite a good fit.

How To Debug GC Issues Using PerfView

Update: If you find this article useful, you can find a lot more information about garbage collection, debugging, PerfView, and .NET performance in my book Writing High-Performance .NET Code.

In my previous artlcle, I discussed 4 ways to optimize your server application for good garbage collection performance. An essential part of that process is being able to analyze your GC performance to know where to focus your efforts. One of the first tools I always turn to is a little utility that has been publically released by Microsoft.

PerfView Overview

PerfView is a stand-alone, no-install utility that can help you debug CPU and memory problems. It’s so light-weight and non-intrusive that it can be used to diagnose production applications with minimal impact.

I’ve never used it for CPU performance, so I can’t comment on that aspect of it, but that is the primary use for it (which is helpful to keep in mind when trying to grok the “quirky” UI).

PerfView collects data in two ways (as far as memory analysis is concerned):

ETW tracing – This is the heart and soul of PerfView. It’s primarily an event analyzer with advanced grouping abilities to show you only the important things. If you want to know more about ETW, see this series at the ntdebugging blog.
Heap dump – PerfView can dump the heap of your process and apply the same analysis and views that it does for events.

The basic view of the utility is a spreadsheet-like UI with function names and associated inclusive/exclusive costs – just like you would expect to see in a typical CPU profiler. The same paradigm is useful for memory analysis as well.

There are other views that summarize the collected events for you in easy-to-understand reports. We’ll take a quick look at all of this.

In this article, I’ll use PerfView to show you how to see the following:

How frequently garbage collections occur and how long they take.
The cause for Gen2 collections.
The source of large-object allocations.
The roots of all the memory in the heap to see who’s holding on to it.
A diff of the heap to see what’s changing most frequently.

Test Program

When using a new utility like this, it’s often extremely helpful to create your own test programs with known behavior to ensure that you can use the utility to see what you expect. I’ve created a very simple one, here:

class Program
{
    private static List<int[]> arrays = new List<int[]>();
    private static Random rand = new Random();

    static void Main(string[] args)
    {            
        Console.WriteLine("Press any key to exit...");
        while (!Console.KeyAvailable)
        {
            int size = rand.Next(1024, 100000);
            int[] newArray = new int[size];
            arrays.Add(newArray);
            System.Threading.Thread.Sleep(10);
        }
        Console.WriteLine("Done, exiting");
    }
}

This program “leaks” memory by continually creating arrays and storing them in a list that never gets cleared.

I also make it use server GC, to match what I discussed in the first article.

You can download the sample solution here.

Taking a Trace

When you startup PerfView, you’ll see a window like this:

The manual is completely integrated into the program and can be accessed using the links in the menu bar. It’s a fairly dense information dump, but you can learn quite a bit about how to really get the most of out this utility.

First, start the test program and let it run in the background until we’re done taking the trace.

In PerfView, open the Collect menu and select the Collect command. A collection dialog will appear. Don’t change any setting for the moment and just hit Start Collection.You’ll see some status indicating the size and duration of the data collected. Let it go for at least 30 seconds. Note that you don’t specify which process you’re interested in at this stage – PerfView collects events for the machine as a whole.

When you’re done click Stop Collection. PerfView will process the collected events for a few seconds or minutes, and then a window will pop up asking you to select a process. Just cancel this (it wants to show you a CPU profile, which we’re not interested in right now) to get back to main screen.

You’ll now see a file show up: PerfViewData.etl (unmerged). Click on the little arrow next to this and you’ll see:

From this, we’ll find all the data we’re interested in.

Get GC Stats (pause times and more)

The first place to start is just to get an overall picture of GC performance for your app. There is a view for just that. Double-click the GCStats report, and that will bring up a window with tables for each app. Find MemoryLeak.exe

My test run yields this summary table:

Every garbage collection was a generation 2 collection (that’s generally a bad thing), but at least they were fast (to be expected in such a simple program).

Reason for Gen 2 Collection

Gen 2 GCs can happen for two reasons—surviving a gen 1 collection, or allocating on the large object heap. This view will also tell us, further down, which of these is the reason:

The collections happened because of large object allocation. You can also see that the second GC happened about 14 seconds after the first, and the next about 32 seconds after that. There are tons of other stats in this view, so look around and see what you can divine about the program’s behavior from this.

Get Source of Large Allocations

From the main PerfView screen, open the GC Heap Alloc Stacks view and find the correct process. This shows you a list of objects which represent the tops of allocation stacks.

PerfView has helpfully organized all large-object allocations under the LargeObject entry. Double click this to see all such all allocations:

Important: If you see entries like this:

OTHER <<clr?>>

Then right-click on the list and click on Lookup Symbols. Follow the instructions to get the symbol server setup so you can see CLR and Windows function names.

From the above entry view, it’s apparent that the vast majority of large objects are arrays being allocated in Main()—exactly what we expect given our predictable leaky program.

A note on the strange column names: remember how I said this program is designed for CPU profiling? These are typical columns for showing% of CPU time in various parts of a stack, repurposed for memory analysis. Inc % is the percent of bytes allocated on this object compared to all recorded allocations, Inc is the number of bytes allocated, and Inc Ct is the number of objects allocated.

In the above example, this reads: Allocated 6589 arrays for a total of 3.9 GB, accounting for 98% of the memory allocated in the process.

By the way, these are not 100% accurate numbers. There is some sampling going on because of how the events work, but it should be fairly close in most applications.

Who’s Referencing Leaking Memory?

One of the few ways to “leak” memory in C# is to hold onto it unknowingly. By taking a heap dump, we can see the path of object references for who’s holding onto memory.

We’ll need to do a different type of collection. In the main PerfView window. Go to the Memory menu and click Take Heap Snapshot.

Find your process and click Dump GC Heap. This performs a live heap walk (that is, the application continues running, so it’s possible the view is slightly inconsistent—not usually an issue), sampling what it finds, and presenting the results in the same type of view as before:

Right away you can see that the static variable MemoryLeak.Program.arrays is holding onto 100% of memory in our application. The stack to the root isn’t that interesting in this case because all static variables are rooted directly, but if this were a member field, you would see the hierarchy of objects that are holding onto these references.

Use Two Heap Dumps to see What’s Increasing Fastest

While the test program is still running, take another heap dump, ensuring you save it to a different file. Open both dump views and in the second one, go to the Diff menu and there will be an option to use the other file as a baseline for the diff. This will bring up another window showing you the changes between the two dump files—extremely helpful for narrowing down the most likely areas for leaks.

Important: If you want to analyze the perf trace on a different computer than the one you took it on, you must tell PerfView to merge the file—this will cause all the different files it generated to be combined and symbols reconciled. Just right-click on the ETL file and select Merge. You can also optionally Zip the file (which implies a Merge).

Next Time

Next time, we’ll look at some more drastic measures for protecting yourself against expensive GCs—for when all else fails.

Resources

Download the sample test program here.
Get PerfView here.

4 Essential Tips for High-Performance Garbage Collection on Servers

Update: If you find this useful, you can read a much more complete treatment of garbage collection and performance in my book Writing High-Performance .NET Code.

Update: Part 2 – How to Debug GC Issues with PerfView is now available.

On this blog, I’ve alluded to the fact that I work on high-performance server applications, most recently in .Net. Writing these in .Net is just as possible as it is in native code, but it does come with its own set of challenges. In particular, one of the biggest things you need to learn how to deal with is garbage collection.

There is a lot out there already written about the CLR’s garbage collector, so I’m not going to go over many of the details. If you need a primer on it, MSDN has some documentation:

Garbage Collection

Read that first. For the rest of this article series, I will assume that you understand how the GC basically works.

In this and future articles, I’ll cover a lot of the stuff I’ve learned to improve application performance in the face of garbage collection.

Tip 1: Use Server GC

There are two modes of garbage collection (GC): workstation and server. As long as you’re running multiple processors, you almost certainly want server mode collection. With workstation mode, a GC happens on the thread that makes the allocation that causes the GC. The collection happens at normal priority.

With server GC, a thread for every core is created just for doing GC. There is also a small object heap and a large object heap created for each GC thread. All of the program’s allocations are spread among these heaps (more on large object heaps later). When no GC is happening, these threads are blocked and do nothing. When a GC is triggered, all of the user threads get paused, and all the GC threads wake up at highest priority and do collection in parallel. All of these optimizations lead to server GC usually being much faster than workstation GC.

A word about concurrent collections: In workstation GC, concurrent collections are enabled by default. However, this only applies to generation 2 collections. Generations 0 and 1 are always blocking. However, given that it’s concurrent, that means that it will compete with your own threads that are trying to get actual work done. In a high-performance server scenario, that may not be acceptable. A better strategy is to ensure that generation 2 collections never (or extremely rarely) happen.

You enable server GC by putting this in app.config:

<configuration>
   <runtime>
      <gcServer enabled="true"/>
   </runtime>
</configuration>

Tip 2: Objects Live Briefly or Forever

A histogram of object lifetimes in your app should look essentially like this:

Object last either a vanishingly brief amount of time, or they last forever – it’s the stuff in the middle that will kill your performance.

This has everything to do with the generations of garbage collection and object survivorship. There are three generations: 0, 1, and 2. Generation 0 happens most often and is the fastest—ideally lasting only a couple of milliseconds, if that. Objects that didn’t get cleaned up in generation 0 are put into generation 1. Generation 1 collections are also very fast, usually as fast as generation 0. The problem, though, is that objects that make it to generation 1 have a fair chance of surviving this generation, and being put into generation 2.

Generation 2 is the problem. A generation 2 collection is much slower than 0 or 1—often on the order of hundreds of milliseconds or even seconds—that means your process is paused completely for that time. You do not want objects to survive to generation 2.

So how often do collections happen? There is no hard-and-fast rule: it all depends on your allocation rate, memory pressure, and patterns that you’ve trained into the GC. The GC will adapt over time, training itself on your memory usage patterns. All of this completely depends on your application and I’ll look at ways to measure all of this in a future article.

Tip 3: All Long-Lived Objects Must Be Pooled

It may be that you can’t ensure all objects for a given request are cleaned up in the first generation 0 collection that occurs. If requests are in memory longer than the time between collections, then you’re guaranteed to have survivorship.

For these types of objects, first see if you can factor them so that not all parts them have to live that long. Control object lifetime very closely and null out references once you’re done.

Once you’ve done that, hopefully there are only a handful of objects that really must last the entire length of a request. For those, create a pool of them with reinitialization semantics—effectively move them to the far end of that histogram above, where they live forever.

This works because of the adaptive nature of the garbage collector – it learns over time that if it does a collection and doesn’t free up much memory, it will schedule that generation of collection to happen less frequently. In my own case, at one point, our server had trained the GC to do a generation 2 collection less often than once per day, under a constant load. With enough work, we could probably get that to essentially never.

You may be able to get quite far without the need to implement object pooling. Or you may need to pool only a small number of objects, and the survivorship of the remaining objects is not enough to cause problematic garbage collections—only measurement and observation will tell you for sure.

Tip 4: All Large Objects Must Be Pooled

There is a way to cause an object to automatically be in generation 2: make it at least 85000 bytes in size. Anything at least that size gets put into a Large Object Heap. Only generation 2 collections service that type of heap.

Want to cause a generation 2 collection? Do this:

byte[] buffer = new byte[85000];

If you want high-performance, you absolutely cannot do this per request on a server. These types of buffers, or other large objects, must be pooled. There is no built-in pooling mechanism in .Net—you must write your own. There are usually not too many large objects you’ll need to pool: strings and byte buffers are the usual suspects, if you need to do much serialization/deserialization, but also look out for collections of any type.

If you want to know more about the Large Object Heap and why 85000 bytes is the threshold, read this great article: Large Object Heap Uncovered.

Pooling collection objects comes with its own set of challenges:

You can’t assume the full collection is valid (the difference between length and capacity). If you use pooled arrays, for example, you have to track the length separately, since only a small portion of the array may be valid. This can drastically affect the interfaces between components.
Pooled collections that can grow over time will cause your memory to rise indefinitely unless you put limits on the size of the pool and/or the size of collections within the pool.
Large Object Heaps are not compacted during collection, which means that you can fragment the heap such that it’s wasting a lot of memory. It all depends on your allocation and collection pattern. I may talk about heap fragmentation in another article.

Once you solve those, you’re good to go… no more generation 2 collections!

Next Time…

In my next article, I’ll cover tools you can use to measure garbage collection statistics, and how you can use that knowledge to improve your performance.

Use Appropriate Collections

.Net makes using fancy collections very easy. In fact, it can be almost too easy. It is so simple to just throw in a List<T> or a ConcurrentDictionary<K, T> that it’s tempting to do it at every opportunity.

Today’s tip is to stop and think critically about the type of collections you need.

Some examples:

I was doing a code review recently and saw that this person was using Dictionary<string, bool> where every value was true. This is not a dictionary—it’s a set with complicated accessors. So use HashSet<T> – it has a simpler API, which will lead to your own code being simpler and more correct semantically. Dictionary<K, T> has a meaning, and if you’re abusing that meaning, then your code is likely incorrect, or in the best case, misleading (and thus a maintenance problem, and thus incorrect!).
Performance characteristics are often non-intuitive. For example, if you’re doing a lot of searching for values, which would you use? Dictionary<K, T> or List<T>? Most people would say the Dictionary, and perhaps in 95% of cases, they’re right. On the other hand, if you only have a few values, you may get better performance because of cache locality with a binary search over a List (or even a linear search), which is usually implemented as an array.
Speaking of arrays, if you have a read-only vector of a known size, use arrays instead of Lists. It’s semantically more correct, simpler, and usually more performant.
Another code review I was doing had usage of ConcurrentDictionary<K, T>. This sounds like a great type to use when you need to modify and/or read a dictionary with multiple threads, but the usage of this type is not that straightforward, and in fact the official documentation is unclear on some things. In this case, it was better to redesign at a higher level to avoid use of this type.
I’ve seen this type of code in reviews a few times:

var list = new List<MyObject>();
for (int i =0;i < source.Count; ++i)
{
    list.Add(source.Get(i));
}

In most apps maybe this doesn’t matter. If you care about performance, then you should care that there are going to be an unknown number of pointless memory allocations, plus multiple copies of the old data into the new arrays. In a high-performance system, this matters. Use the constructor of the collection that takes an initial size. (Guess how many items are in the default List<T> internal item array: None.)

Some questions you can ask yourself:

Is the collection semantically correct? Or am I abusing it because it can do what I want? Do I have restrictions on the use of the collection that are not obeyed by the API of the collection? Is there a more appropriate data structure I can use? If not, should I wrap this collection into an API with suitable restrictions?
Am I using the simplest collection possible?
Is this collection type performant for how it will be used? How can I measure to make sure? Am I making assumptions that may be unfounded because of usage patterns or hardware optimizations? Am I initializing the data structure correctly, to avoid unnecessary memory reallocations?
Is the collection type I’m considering too complex to use effectively? Would it be better to redesign something at a higher level to avoid needing to use this collection type?

Collections are usually the core of any application (if you don’t have data, what you are acting upon?). Getting these right means simpler code, better performance, higher readability, and long-term maintainability. It doesn’t take any more work (usually) – it just takes a few moments to think about what you’re really doing.

Like this tip? Check out my book, C # 4.0 How-To, where you’ll finds hundreds of great tips like this one.