Tag Archives: gc

Announcing Microsoft.IO.RecycableMemoryStream

It is with great pleasure that I announce the latest open source release from Microsoft. This time it’s coming from Bing.

Before explaining what it is and how it works, I have to mention that nearly all of the work for actually getting this setup on GitHub and preparing it for public release was done by by Chip Locke [Twitter | Blog], one of our superstars.

What It Is

Microsoft.IO.RecyclableMemoryStream is a MemoryStream replacement that offers superior behavior for performance-critical systems. In particular it is optimized to do the following:

Eliminate Large Object Heap allocations by using pooled buffers
Incur far fewer gen 2 GCs, and spend far less time paused due to GC
Avoid memory leaks by having a bounded pool size
Avoid memory fragmentation
Provide excellent debuggability
Provide metrics for performance tracking

In my book Writing High-Performance .NET Code, I had this anecdote:

In one application that suffered from too many LOH allocations, we discovered that if we pooled a single type of object, we could eliminate 99% of all problems with the LOH. This was MemoryStream, which we used for serialization and transmitting bits over the network. The actual implementation is more complex than just keeping a queue of MemoryStream objects because of the need to avoid fragmentation, but conceptually, that is exactly what it is. Every time a MemoryStream object was disposed, it was put back in the pool for reuse.

-Writing High-Performance .NET Code, p. 65

The exact code that I’m talking about is what is being released.

How It Works

Here are some more details about the features:

A drop-in replacement for System.IO.MemoryStream. It has exactly the same semantics, as close as possible.
Rather than pooling the streams themselves, the underlying buffers are pooled. This allows you to use the simple Dispose pattern to release the buffers back to the pool, as well as detect invalid usage patterns (such as reusing a stream after it’s been disposed).
Completely thread-safe. That is, the MemoryManager is thread safe. Streams themselves are inherently NOT thread safe.
Each stream can be tagged with an identifying string that is used in logging. This can help you find bugs and memory leaks in your code relating to incorrect pool use.
Debug features like recording the call stack of the stream allocation to track down pool leaks
Maximum free pool size to handle spikes in usage without using too much memory.
Flexible and adjustable limits to the pooling algorithm.
Metrics tracking and events so that you can see the impact on the system.
Multiple internal pools: a default “small” buffer (default of 128 KB) and additional, “large” pools (default: in 1 MB chunks). The pools look kind of like this:

In normal operation, only the small pool is used. The stream abstracts away the use of multiple buffers for you. This makes the memory use extremely efficient (much better than MemoryStream’s default doubling of capacity).

The large pool is only used when you need a contiguous byte[] buffer, via a call to GetBuffer or (let’s hope not) ToArray. When this happens, the buffers belonging to the small pool are released and replaced with a single buffer at least as large as what was requested. The size of the objects in the large pool are completely configurable, but if a buffer greater than the maximum size is requested then one will be created (it just won’t be pooled upon Dispose).

Examples

You can jump right in with no fuss by just doing a simple replacement of MemoryStream with something like this:

var sourceBuffer = new byte[]{0,1,2,3,4,5,6,7}; 
var manager = new RecyclableMemoryStreamManager(); 
using (var stream = manager.GetStream()) 
{ 
    stream.Write(sourceBuffer, 0, sourceBuffer.Length); 
}

Note that RecyclableMemoryStreamManager should be declared once and it will live for the entire process–this is the pool. It is perfectly fine to use multiple pools if you desire.

To facilitate easier debugging, you can optionally provide a string tag, which serves as a human-readable identifier for the stream. In practice, I’ve usually used something like “ClassName.MethodName” for this, but it can be whatever you want. Each stream also has a GUID to provide absolute identity if needed, but the tag is usually sufficient.

using (var stream = manager.GetStream("Program.Main"))
{
    stream.Write(sourceBuffer, 0, sourceBuffer.Length);
}

You can also provide an existing buffer. It’s important to note that this buffer will be copied into the pooled buffer:

var stream = manager.GetStream("Program.Main", sourceBuffer, 
                                    0, sourceBuffer.Length);

You can also change the parameters of the pool itself:

int blockSize = 1024;
int largeBufferMultiple = 1024 * 1024;
int maxBufferSize = 16 * largeBufferMultiple;

var manager = new RecyclableMemoryStreamManager(blockSize, 
                                                largeBufferMultiple, 
                                                maxBufferSize);

manager.GenerateCallStacks = true;
manager.AggressiveBufferReturn = true;
manager.MaximumFreeLargePoolBytes = maxBufferSize * 4;
manager.MaximumFreeSmallPoolBytes = 100 * blockSize;

Is this library for everybody? No, definitely not. This library was designed with some specific performance characteristics in mind. Most applications probably don’t need those. However, if they do, then this library can absolutely help reduce the impact of GC on your software.

Let us know what you think! If you find bugs or want to improve it in some way, then dive right into the code on GitHub.

Digging Into .NET Object Allocation Fundamentals

[Note: this article also appeared on CodeProject]

Introduction

While understanding garbage collection fundamentals is vital to working with .NET, it is also important to understand how object allocation works. It shows you just how simple and performant it is, especially compared to the potentially blocking nature of native heap allocations. In a large, native, multi-threaded application, heap allocations can be major performance bottleneck which requires you to perform all sorts of custom heap management techniques. It’s also harder to measure when this is happening because many of those details are hidden behind the OS’s allocation APIs. More importantly, understanding this will give you clues to how you can mess up and make object allocation far less efficient.

In this article, I want to go through an example taken from Chapter 2 of Writing High-Performance .NET Code and then take it further with some additional examples that weren’t covered in the book.

Viewing Object Allocation in a Debugger

Let’s start with a simple object definition: completely empty.

class MyObject 
{
}

static void Main(string[] args)
{
    var x = new MyObject();
}

In order to examine what happens during allocation, we need to use a “real” debugger, like Windbg. Don’t be afraid of this. If you need a quick primer on how to get started, look at the free sample chapter on this page, which will get you up and running in no time. It’s not nearly as bad you think.

Build the above program in Release mode for x86 (you can do x64 if you’d like, but the samples below are x86).

In Windbg, follow these steps to start and debug the program:

Ctrl+E to execute a program. Navigate to and open the built executable file.
Run command: sxe ld clrjit (this tells the debugger to break on loading any assembly with clrjit in the name, which you need loaded before the next steps)
Run command: g (continues execution)
When it breaks, run command: .loadby sos clr (loads .NET debugging tools)
Run command: !bpmd ObjectAllocationFundamentals Program.Main (Sets a breakpoint at the beginning of a method. The first argument is the name of the assembly. The second is the name of the method, including the class it is in.)
Run command: g

Execution will break at the beginning of the Main method, right before new() is called. Open the Disassembly window to see the code.

Here is the Main method’s code, annotated for clarity:

; Copy method table pointer for the class into
; ecx as argument to new()
; You can use !dumpmt to examine this value.
mov ecx,006f3864h
; Call new
call 006e2100 
; Copy return value (address of object) into a register
mov edi,eax

Note that the actual addresses will be different each time you execute the program. Step over (F10, or toolbar) a few times until call 006e2100 (or your equivalent) is highlighted. Then Step Into that (F11). Now you will see the primary allocation mechanism in .NET. It’s extremely simple. Essentially, at the end of the current gen0 segment, there is a reserved bit of space which I will call the allocation buffer. If the allocation we’re attempting can fit in there, we can update a couple of values and return immediately without more complicated work.

If I were to outline this in pseudocode, it would look like this:

if (object fits in current allocation buffer)
{
   Increment a pointer, return address;
}
else
{
   call JIT_New to do more complicated work in CLR
}

The actual assembly looks like this:

; Set eax to value 0x0c, the size of the object to
; allocate, which comes from the method table
006e2100 8b4104          mov     eax,dword ptr [ecx+4] ds:002b:006f3868=0000000c
; Put allocation buffer information into edx
006e2103 648b15300e0000  mov     edx,dword ptr fs:[0E30h]
; edx+40 contains the address of the next available byte
; for allocation. Add that value to the desired size.
006e210a 034240          add     eax,dword ptr [edx+40h]
; Compare the intended allocation against the
; end of the allocation buffer.
006e210d 3b4244          cmp     eax,dword ptr [edx+44h]
; If we spill over the allocation buffer,
; jump to the slow path
006e2110 7709            ja      006e211b
; update the pointer to the next free
; byte (0x0c bytes past old value)
006e2112 894240          mov     dword ptr [edx+40h],eax
; Subtract the object size from the pointer to
; get to the start of the new obj
006e2115 2b4104          sub     eax,dword ptr [ecx+4]
; Put the method table pointer into the
; first 4 bytes of the object.
; eax now points to new object
006e2118 8908            mov     dword ptr [eax],ecx
; Return to caller
006e211a c3              ret
; Slow Path - call into CLR method
006e211b e914145f71      jmp     clr!JIT_New (71cd3534)

In the fast path, there are only 9 instructions, including the return. That’s incredibly efficient, especially compared to something like malloc. Yes, that complexity is traded for time at the end of object lifetime, but so far, this is looking pretty good!

What happens in the slow path? The short answer is a lot. The following could all happen:

A free slot somewhere in gen0 needs to be located
A gen0 GC is triggered
A full GC is triggered
A new memory segment needs to be allocated from the operating system and assigned to the GC heap
Objects with finalizers need extra bookkeeping
Possibly more…

Another thing to notice is the size of the object: 0x0c (12 decimal) bytes. As covered elsewhere, this is the minimum size for an object in a 32-bit process, even if there are no fields.

Now let’s do the same experiment with an object that has a single int field.

class MyObjectWithInt { int x; }

Follow the same steps as above to get into the allocation code.

The first line of the allocator on my run is:

00882100 8b4104          mov     eax,dword ptr [ecx+4] ds:002b:00893874=0000000c

The only interesting thing is that the size of the object (0x0c) is exactly the same as before. The new int field fit into the minimum size. You can see this by examining the object with the !DumpObject command (or the abbreviated version: !do). To get the address of the object after it has been allocated, step over instructions until you get to the ret instruction. The address of the object is now in the eax register, so open up the Registers view and see the value. On my computer, it has a value of 2372770. Now execute the command: !do 2372770

You should see similar output to this:

0:000> !do 2372770
Name:        ConsoleApplication1.MyObjectWithInt
MethodTable: 00893870
EEClass:     008913dc
Size:        12(0xc) bytes
File:        D:\Ben\My Documents\Visual Studio 2013\Projects\ConsoleApplication1\ConsoleApplication1\bin\Release\ConsoleApplication1.exe
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
70f63b04  4000001        4         System.Int32  1 instance        0 x

This is curious. The field is at offset 4 (and an int has a length of 4), so that only accounts for 8 bytes (range 0-7). Offset 0 (i.e., the object’s address) contains the method table pointer, so where are the other 4 bytes? This is the sync block and they are actually at offset -4 bytes, before the object’s address. These are the 12 bytes.

Try it with a long.

class MyObjectWithLong { long x; }

The first line of the allocator is now:

00f22100 8b4104          mov     eax,dword ptr [ecx+4] ds:002b:00f33874=00000010

Showing a size of 0x10 (decimal 16 bytes), which we would expect now. 12 byte minimum object size, but 4 already in the overhead, so an extra 4 bytes for the 8 byte long. And an examination of the allocated object shows an object size of 16 bytes as well.

0:000> !do 2932770
Name:        ConsoleApplication1.MyObjectWithLong
MethodTable: 00f33870
EEClass:     00f313dc
Size:        16(0x10) bytes
File:        D:\Ben\My Documents\Visual Studio 2013\Projects\ConsoleApplication1\ConsoleApplication1\bin\Release\ConsoleApplication1.exe
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
70f5b524  4000002        4         System.Int64  1 instance 0 x

If you put an object reference into the test class, you’ll see the same thing as you did with the int.

Finalizers

Now let’s make it more interesting. What happens if the object has a finalizer? You may have heard that objects with finalizers have more overhead during GC. This is true–they will survive longer, require more CPU cycles, and generally cause things to be less efficient. But do finalizers also affect object allocation?

Recall that our Main method above looked like this:

mov ecx,006f3864h
call 006e2100 
mov edi,eax

If the object has a finalizer, however, it looks like this:

mov     ecx,119386Ch
call    clr!JIT_New (71cd3534)
mov     esi,eax

We’ve lost our nifty allocation helper! We have to now jump directly to JIT_New. Allocating an object that has a finalizer is a LOT slower than a normal object. More internal CLR structures need to be modified to track this object’s lifetime. The cost isn’t just at the end of object lifetime.

How much slower is it? In my own testing, it appears to be about 8-10x worse than the fast path of allocating a normal object. If you allocate a lot of objects, this difference is considerable. For this, and other reasons, just don’t add a finalizer unless it really is required.

Calling the Constructor

If you are particularly eagle-eyed, you may have noticed that there was no call to a constructor to initialize the object once allocated. The allocator is changing some pointers, returning you an object, and there is no further function call on that object. This is because memory that belongs to a class field is always pre-initialized to 0 for you and these objects had no further initialization requirements. Let’s see what happens if we change to the following definition:

class MyObjectWithInt { int x = 13; }

Now the Main function looks like this:

mov     ecx,0A43834h
; Allocate memory
call    00a32100
; Copy object address to esi
mov     esi,eax
; Set object + 4 to value 0x0D (13 decimal)
mov     dword ptr [esi+4],0Dh

The field initialization was inlined into the caller!

Note that this code is exactly equivalent:

class MyObjectWithInt { int x; public MyObjectWithInt() { this.x = 13; } }

But what if we do this?

class MyObjectWithInt 
{ 
    int x; 

    [MethodImpl(MethodImplOptions.NoInlining)]  
    public MyObjectWithInt() 
    { 
        this.x = 13; 
    } 
}

This explicitly disables inlining for the object constructor. There are other ways of preventing inlining, but this is the most direct.

Now we can see the call to the constructor happening after the memory allocation:

mov     ecx,0F43834h
call    00f32100
mov     esi,eax
mov     ecx,esi
call    dword ptr ds:[0F43854h]

Exercise for the Reader

Can you get the allocator shown above to jump to the slow path? How big does the allocation request have to be to trigger this? (Hint: Try allocating arrays of various sizes.) Can you figure this out by examining the registers and other values from the running code?

Summary

You can see that in most cases, allocation of objects in .NET is extremely fast and efficient, requiring no calls into the CLR and no complicated algorithms in the simple case. Avoid finalizers unless absolutely needed. Not only are they less efficient during cleanup in a garbage collection, but they are slower to allocate as well.

Play around with the sample code in the debugger to get a feel for this yourself. If you wish to learn more about .NET memory handling, especially garbage collection, take a look at the book Writing High-Performance .NET Code.

How To Debug GC Issues Using PerfView

Update: If you find this article useful, you can find a lot more information about garbage collection, debugging, PerfView, and .NET performance in my book Writing High-Performance .NET Code.

In my previous artlcle, I discussed 4 ways to optimize your server application for good garbage collection performance. An essential part of that process is being able to analyze your GC performance to know where to focus your efforts. One of the first tools I always turn to is a little utility that has been publically released by Microsoft.

PerfView Overview

PerfView is a stand-alone, no-install utility that can help you debug CPU and memory problems. It’s so light-weight and non-intrusive that it can be used to diagnose production applications with minimal impact.

I’ve never used it for CPU performance, so I can’t comment on that aspect of it, but that is the primary use for it (which is helpful to keep in mind when trying to grok the “quirky” UI).

PerfView collects data in two ways (as far as memory analysis is concerned):

ETW tracing – This is the heart and soul of PerfView. It’s primarily an event analyzer with advanced grouping abilities to show you only the important things. If you want to know more about ETW, see this series at the ntdebugging blog.
Heap dump – PerfView can dump the heap of your process and apply the same analysis and views that it does for events.

The basic view of the utility is a spreadsheet-like UI with function names and associated inclusive/exclusive costs – just like you would expect to see in a typical CPU profiler. The same paradigm is useful for memory analysis as well.

There are other views that summarize the collected events for you in easy-to-understand reports. We’ll take a quick look at all of this.

In this article, I’ll use PerfView to show you how to see the following:

How frequently garbage collections occur and how long they take.
The cause for Gen2 collections.
The source of large-object allocations.
The roots of all the memory in the heap to see who’s holding on to it.
A diff of the heap to see what’s changing most frequently.

Test Program

When using a new utility like this, it’s often extremely helpful to create your own test programs with known behavior to ensure that you can use the utility to see what you expect. I’ve created a very simple one, here:

class Program
{
    private static List<int[]> arrays = new List<int[]>();
    private static Random rand = new Random();

    static void Main(string[] args)
    {            
        Console.WriteLine("Press any key to exit...");
        while (!Console.KeyAvailable)
        {
            int size = rand.Next(1024, 100000);
            int[] newArray = new int[size];
            arrays.Add(newArray);
            System.Threading.Thread.Sleep(10);
        }
        Console.WriteLine("Done, exiting");
    }
}

This program “leaks” memory by continually creating arrays and storing them in a list that never gets cleared.

I also make it use server GC, to match what I discussed in the first article.

You can download the sample solution here.

Taking a Trace

When you startup PerfView, you’ll see a window like this:

The manual is completely integrated into the program and can be accessed using the links in the menu bar. It’s a fairly dense information dump, but you can learn quite a bit about how to really get the most of out this utility.

First, start the test program and let it run in the background until we’re done taking the trace.

In PerfView, open the Collect menu and select the Collect command. A collection dialog will appear. Don’t change any setting for the moment and just hit Start Collection.You’ll see some status indicating the size and duration of the data collected. Let it go for at least 30 seconds. Note that you don’t specify which process you’re interested in at this stage – PerfView collects events for the machine as a whole.

When you’re done click Stop Collection. PerfView will process the collected events for a few seconds or minutes, and then a window will pop up asking you to select a process. Just cancel this (it wants to show you a CPU profile, which we’re not interested in right now) to get back to main screen.

You’ll now see a file show up: PerfViewData.etl (unmerged). Click on the little arrow next to this and you’ll see:

From this, we’ll find all the data we’re interested in.

Get GC Stats (pause times and more)

The first place to start is just to get an overall picture of GC performance for your app. There is a view for just that. Double-click the GCStats report, and that will bring up a window with tables for each app. Find MemoryLeak.exe

My test run yields this summary table:

Every garbage collection was a generation 2 collection (that’s generally a bad thing), but at least they were fast (to be expected in such a simple program).

Reason for Gen 2 Collection

Gen 2 GCs can happen for two reasons—surviving a gen 1 collection, or allocating on the large object heap. This view will also tell us, further down, which of these is the reason:

The collections happened because of large object allocation. You can also see that the second GC happened about 14 seconds after the first, and the next about 32 seconds after that. There are tons of other stats in this view, so look around and see what you can divine about the program’s behavior from this.

Get Source of Large Allocations

From the main PerfView screen, open the GC Heap Alloc Stacks view and find the correct process. This shows you a list of objects which represent the tops of allocation stacks.

PerfView has helpfully organized all large-object allocations under the LargeObject entry. Double click this to see all such all allocations:

Important: If you see entries like this:

OTHER <<clr?>>

Then right-click on the list and click on Lookup Symbols. Follow the instructions to get the symbol server setup so you can see CLR and Windows function names.

From the above entry view, it’s apparent that the vast majority of large objects are arrays being allocated in Main()—exactly what we expect given our predictable leaky program.

A note on the strange column names: remember how I said this program is designed for CPU profiling? These are typical columns for showing% of CPU time in various parts of a stack, repurposed for memory analysis. Inc % is the percent of bytes allocated on this object compared to all recorded allocations, Inc is the number of bytes allocated, and Inc Ct is the number of objects allocated.

In the above example, this reads: Allocated 6589 arrays for a total of 3.9 GB, accounting for 98% of the memory allocated in the process.

By the way, these are not 100% accurate numbers. There is some sampling going on because of how the events work, but it should be fairly close in most applications.

Who’s Referencing Leaking Memory?

One of the few ways to “leak” memory in C# is to hold onto it unknowingly. By taking a heap dump, we can see the path of object references for who’s holding onto memory.

We’ll need to do a different type of collection. In the main PerfView window. Go to the Memory menu and click Take Heap Snapshot.

Find your process and click Dump GC Heap. This performs a live heap walk (that is, the application continues running, so it’s possible the view is slightly inconsistent—not usually an issue), sampling what it finds, and presenting the results in the same type of view as before:

Right away you can see that the static variable MemoryLeak.Program.arrays is holding onto 100% of memory in our application. The stack to the root isn’t that interesting in this case because all static variables are rooted directly, but if this were a member field, you would see the hierarchy of objects that are holding onto these references.

Use Two Heap Dumps to see What’s Increasing Fastest

While the test program is still running, take another heap dump, ensuring you save it to a different file. Open both dump views and in the second one, go to the Diff menu and there will be an option to use the other file as a baseline for the diff. This will bring up another window showing you the changes between the two dump files—extremely helpful for narrowing down the most likely areas for leaks.

Important: If you want to analyze the perf trace on a different computer than the one you took it on, you must tell PerfView to merge the file—this will cause all the different files it generated to be combined and symbols reconciled. Just right-click on the ETL file and select Merge. You can also optionally Zip the file (which implies a Merge).

Next Time

Next time, we’ll look at some more drastic measures for protecting yourself against expensive GCs—for when all else fails.

Resources

Download the sample test program here.
Get PerfView here.

Philosophical Geek

Code and musings by Ben Watson