Tag Archives: Code

Never make assumptions about performance

The importance of measuring performance changes is a topic that has been covered by others smarter and more experienced than me, but I have a recent simple tale.

I’ve simplified the code quite a bit in order to demonstrate the issue. Suppose I have a wrapper around an image (it has many more attributes):

   1: class Picture
   2: {
   3:     Image _image;
   4:     string _path;
   5: 
   6:     public Image Photo
   7:     {
   8:         get
   9:         {
  10:             if (_image==null && !string.IsNullOrEmpty(_path))
  11:             {
  12:                 _image = Bitmap.FromFile(_path);
  13:             }
  14:             return _image;
  15:         }
  16:     }
  17: 
  18: }

I had this and a view that loading about 2,700 of these into a customized ListView control at program startup. On a cold start (where none of the pictures were in the disk’s cache), it would take 27 seconds. Unacceptable.

What to do?

My first thought was to load the pictures asynchronously. I wrapped Bitmap.FromFile() into a function and called it asynchronously. When it was done, it fired an event that percolated up to the top.

Well, I spent about 30 minutes implementing that and ran it–horrible. The list showed up immediately, but it was unusable. The problem? Dumping 2,700 items into the ThreadPool queue is a problem. It doesn’t create 2,700 threads, but it causes enough problems to not be a viable option.

Asynchronicity is still the answer, though. But it’s at a different level. Instead of loading the individual images asynchronously, I skipped loading the images when creating the list control and instead launched a thread to load all the images and update them when done. The list loads in under a second, and the pictures show up little by little after that.

Measure, measure, measure. And pay attention.

Technorati Tags: ,,


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

6 Ways to Increase Your Confidence As You Code

One of the key requirements for being able to reliably update software is the confidence that the changes you are making are safe. The amount of confidence required increases with the complexity of the system.

In my day job I work on a real-time messaging system that can have very, very little downtime. As the service grows and sees more traffic, the amount of downtime shrinks. We start to worry now if upgrades take longer than 5 minutes. (It’s almost to the point where we’ll need redundant systems in order to do maintenance).

To upgrade this software, I have to have an awful lot of confidence in the code changes made. Sometimes that confidence varies.

What to do to gain confidence:

1. Consistent process

Having a set of rules to follow is a fundamental requirement of good software engineering. I’m not going to discuss what the process should be, but you should have one that works well.

Why is this important?

Programmers like order. We like well-defined problems where we can see the end from the beginning. We don’t like haziness, indeterminism, or too many choices.

A process nails down the unknowns–it tells you very specifically what the next thing to do is. A good process leaves no room for doubt.

A good development, testing, and deployment process is the first step to building confidence in what you’re doing.

For my messaging system example, here’s a short summary of what our upgrade process is. We didn’t just come up with it–it evolved as our business grew and the requirements grew with it.

  1. All unit and integration tests pass (we have about 2200 automated tests for this that can run in about 4 minutes)
  2. We’ve run it on staging server for a few weeks with no issues.
  3. All new features have been tested on the staging server
  4. Formal change document has been submitted and approved.
  5. Formal test results documented.
  6. Previous version backed up.
  7. Perform upgrade
  8. Monitor system (including custom automated monitoring tools)

At any point, I know where we are in the process and what needs to be done next. Sure, there may be details within these steps that require thought and creativity, but the process guides it all and makes us more confident that we’re not performing ad-hoc operations.

2. Unit Testing

There are other types of testing, but it all starts at the unit level, with simple tests that exercise your code line by line, function by function, feature by feature. I recently wrote a few thoughts about unit testing. Unit testing is where you can see the overall wellness of your code–you want that green bar!

Without unit testing, how do you know the code you’re writing is doing what it should? do you just run it and push it through its paces? This is highly inefficient for most types of code. You’ll run out of steam before you start getting close to edge cases.

The fact is that automated unit tests are a baseline for confidence in your code. You need to be able demonstrate time and again that your code performs well.

This all presupposes that you are writing good unit tests. If you’re not sure, start studying. I don’t buy the arguments about lulling developers into a false sense of security–sure, that can happen, but having good developers who understand this is a prerequisite.

If you’re not unit testing–what is your basis for confidence in your code?

3. Code Coverage

Code coverage goes hand-in-hand with unit testing as a good way to automatically discover what areas of your program are in need of more testing. I’ve found that one of the biggest barriers to unit testing a large C++ application we have is that we have no way of easily measuring test coverage. If we had time, we could definitely to the analysis ourselves, or we could spend a lot of money to get a C++ instrumentation profiler, but these are slow and very tedious to use in my experience.

In .Net, use the tools to your advantage.

The psychological benefits of seeing 75-, 90-, 95-, even 100-percent coverage are immense. You know that every line of the program has at least been touched.

Of course, most code coverage tools analyze line coverage, not path coverage. Combine  complexity analysis with code coverage to determine which functionality should probably have better testing. There are plenty of free and commercial tools that will give you cyclomatic complexity, among other metrics.

Use other analysis tools like FxCop to make sure your other ducks are in a row. It can find easy-to-overlook problems like not validating arguments of public methods, which can then lead to more unit tests and more coverage to achieve.

4. Automation

Take yourself out of the equation as much as possible. The point of a process is to be repeatable–it’s like automating yourself. Not only should unit testing be automated (thankfully, most testing frameworks handle this easily), but so should coverage and quality analyses.

What about deployment? Automate it. Documentation generation? CD master creation? Web upload? E-mail notification? Automate them all. Production builds should be invoked with a single command.

Working on boring, repeatable code? Automate it with code-gen.

The bottom line is: Don’t waste your brain cells on stuff that is highly repeatable, especially when it is prone to mistakes.

5. Code Review

Last week, a rather serious bug was discovered in some of our software (not released yet, thankfully, but close). The bug was mine, and I knew exactly what the problem was, but instead of designing a solution by myself, I brought a co-worker into the discussion just to bounce ideas off of. He had great suggestions, and made me think of things I might not necessarily have thought of on my own. We both went over the code and came to a solution that was simple and acceptable to both of us. The confidence level was much higher with this than it would have been otherwise.

This story is repeated daily by programmers throughout the world. Code review is a practice based on the simple notion that there is no one person smart enough to get it correct the first time.

Even if you’re working alone, which I often do, it pays huge dividends to regularly review your code with an eye for finding trouble. If you see any weakness at all, don’t ignore it–fix it. If you’re reviewing your own code, it’s a good idea to wait a bit after the time you wrote it. This gives your brain a chance to forget a little bit about it. Then, if you find you can’t understand it anymore, it’s either too complicated, or (if it fundamentally really is complicated) you need better comments.

Reviewing with other people has more benefit, however. Not everybody thinks the same way about problems. People have different experience, different expertise and focus, and you can’t take advantage of that if you don’t let them teach you. Even if the other people have less expertise than you, it is still beneficial (assuming they have some basic competency that they can bring to the discussion).

Once you let other people tear into your code (nicely, I hope), your confidence can be higher because you can add the confidence other people have in it (once your problems are corrected, of course!)

6. Repeatable Experiences

In the end, one of the best ways to increase your confidence in yourself, your code, and your practices is to have the evidence of repeated experiences behind you. You’re always learning, and that learning contributes to improvements in processes, testing, and your personal coding practices. Once you learn what works, especially during tricky upgrades, you can go into the next trial with increased confidence that you’re doing something right.

Have any other ideas on increasing confidence? Leave them in the comments!

Technorati Tags: ,,


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Easily Unit Testing Event Handlers

In C#, If you need to unit test a class that fires an event in certain circumstances (perhaps even asynchronously), you need to handle a little more than just running some code and doing the assertion. You have to make sure your unit test waits for the event to be fired. Here’s one naive way of doing it, a WRONG way:

   1: private bool statsUpdated = false;
   2: private ManualResetEvent statsUpdatedEvent = new ManualResetEvent(false);
   3:
   4: [Test]
   5: public void CheckStats()
   6: {
   7:     BrickDatabase db = new BrickDatabase(tempFolder, maxCacheAge);
   8:
   9:     statsUpdated = false;
  10:     statsUpdatedEvent.Reset();
  11:
  12:     db.InventoryStatsUpdated += new EventHandler(db_InventoryStatsUpdated);
  13:     db.DoSomethingThatFiresEvent();
  14:
  15:     statsUpdatedEvent.WaitOne();
  16:
  17:     Assert.IsTrue(statsUpdated);
  18: }
  19:
  20: void db_InventoryStatsUpdated(object sender, EventArgs e)
  21: {
  22:     statsUpdated = true;
  23:     statsUpdatedEvent.Set();
  24: }

There are a number of things wrong with this:

  1. The class variables. More complex unit test class. Have to coordinate these variables across multiple functions.
  2. Since they are class variables, you will want to reuse them, but you’d better remember to reset the event and the boolean every time!
  3. Have to have two functions to do something really, really simple.
  4. The WaitOne() does not have a timeout, so if the wait is ever satisfied then statsUpdated is guaranteed to be true.

Here’s a better way of doing it, using anonymous methods in C# 2.0:

   1: [Test]
   2: public void CheckStats()
   3: {
   4:     BrickDatabase db = new BrickDatabase(tempFolder, maxCacheAge);
   5:     bool statsUpdated = false;
   6:     ManualResetEvent statsUpdatedEvent = new ManualResetEvent(false);
   7:
   8:     db.InventoryStatsUpdated += delegate
   9:     {
  10:         statsUpdated = true;
  11:         statsUpdatedEvent.Set();
  12:     };
  13:
  14:     db.DoSomethingThatFiresEvent();
  15:
  16:     statsUpdatedEvent.WaitOne(5000,false);
  17:
  18:     Assert.IsTrue(statsUpdated);
  19: }

Improvements?

  1. The event is just part of the method. Since the event handler is an anonymous delegate, it can access the enclosing method’s local variables.
  2. Added 5,000ms timeout to the WaitOne() function to prevent hanging of unit tests.

Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Don’t delay your merging

Another one of those lessons learned posts. I know I’m supposed to merge changes across branches often to minimize the pain, but I didn’t do it.

Here’s the scenario: We’ve got 3 development branches: 6.3, 6.4, and 7.0 (the trunk). 6.3 and 6.4 are technically maintenance branches because we didn’t anticipate needing to them, but we are. Here’s where it gets funny. 6.4 is actually a branch off of 7.0 with some new features removed. 6.4 is the code base converted to handle unicode. 6.3 is the current development version, and it’s features need to be merged into 6.4 These two branches are rather divergent in places. It’s been months since the branches were synchronized. In addition to 20 conflicted files, the 13 localized resource DLLs can’t automatically be merged because their location changed. Yeesh…

So now I’m spending all day using DiffMerge to do these files. Not a fun day…

Lesson learned. Do frequent merges.

At least it’s Friday. Tomorrow, my wife and I are heading down to North Carolina to visit my grandmother before she heads to California for Christmas.

By the way, still no Internet at home. Comcast says it could be 28 days before the local construction company gets around to installing the new drop to our home. I’ve been getting a lot of piano practice and reading in.


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Difference between ConfigurationSettings and ConfigurationManager

If you upgraded a project from .Net 1.0/1.1 to .Net 2.0, and it used application configuration files, you will soon come across the compiler warning message

‘System.Configuration.ConfigurationSettings.AppSettings’ is obsolete: ‘This method is obsolete, it has been replaced by System.Configuration!System.Configuration.ConfigurationManager.AppSettings’   

You have to add a reference to the System.Configuration assembly, but once you do, you can just do a search and replace on ConfigurationSettings.AppSettings and replace it with ConfigurationManager.AppSettings and the functionality will remain the same.

The advantage of using ConfigurationManager is laid out in the MSDN docs, but briefly:

  1. Access sections other than appConfig, including connectionStrings, and your own custom sections
  2. Can read and write the configuration.
  3. It can have separate settings for the application and the users
  4. It’s extensible

Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Instant Searching and Filtering in .Net – Part 4

This is the final part of my series on instant searching and filtering using C#. The only further issue that I wanted to cover was efficiently using a ListView when the items  will change so often.

ListViews already have the concept of a virtual mode, where the consumer of the class must supply the items that are displayed.

When the VirtualMode property is set to true, you also need to set the VirtualListSize property to tell the control how many items it needs to think it has.

With this simple starting point, we can have something like the following:

 

   1: _listControl.VirtualMode = true;
   2: _listControl.RetrieveVirtualItem += 
   3:     new RetrieveVirtualItemEventHandler(list_RetrieveVirtualItem);

and list_RetrieveVirtualItem could look like this:

 

   1: void list_RetrieveVirtualItem(object sender, RetrieveVirtualItemEventArgs e)
   2: {
   3:     e.Item = new ListViewItem();
   4:     e.Item.Text = "Item Text Here";
   5: }

Using this, it would be fairly easy to hook up our filtered results and away we go. But we have a few problems:

  • We don’t want to generate a new ListViewItem every time the list control asks us for one. This is extremely wasteful–thousands of objects would be springing to life and dying constantly, which is a very inefficient use of the garbage collector.
  • On the other hand, we don’t want to cache a ListViewItem object for every single line of text that was indexed. Only a few of them will be on the screen at a time, and we’ve already cached the text, so we should use that.

Therefore, we need a way to efficiently cache a limited number of ListViewItem objects and use them as necessary, updating their Text and Tag properties as necessary.

With that introduction, let’s get into the code.

Our own ListView

First things first: there’s a bug in the .Net framework that impacts virtual mode on ListView. So let’s derive our own to resolve it:

 

   1: class QuickList : System.Windows.Forms.ListView
   2: {
   3:     /// <summary>
   4:     /// This is needed to fix a bug in the framework
   5:     /// </summary>
   6:     public new int VirtualListSize
   7:     {
   8:         get
   9:         {
  10:             return base.VirtualListSize;
  11:         }
  12:         set
  13:         {
  14:             if (Items.Count > 0)
  15:             {
  16:                 EnsureVisible(0);
  17:             }
  18:             base.VirtualListSize = value;
  19:         }
  20:     }
  21: }
 

That’s all there is to that. If you would like more information about the bug, Shahar Prish wrote up an excellent description and workarounds.

With that out of the way, most of the action we’re concerned with will take place in a separate class, which I’ll call…

QuickListController

This controller class is responsible for coordinating the filter text, filtered results, and their display in the list control. In my own use of these concepts, I’ve further factored out the filtering mechanism to be outside of this controller, but in this sample it’s better to combine it all for ease of understanding.

So how will this work? We want a cache of only the items that will be displayed. Unfortunately, the ListView control is going to ask us for item 30,775 for example, if we’re scrolled down a huge list. We don’t want to cache 30,775 items when only the surrounding 20 items are displayed. Fortunately, it’s easy to track a minimal number of items if we know the first visible index

Here are the fields and properties we’ll need:

 

   1: class QuickListController
   2: {
   3:     private int _topIndex = -1;
   4:     private int _bottomIndex = -1;
   5:     private IIndexer<string> _indexer;
   6:     private IList<string> _allLines;
   7:     private IList<string> _currentResults;
   8:     private QuickList _listControl;
   9:     List<ListViewItem> _itemCache = new List<ListViewItem>();
  10:     private string _filter;
  11:     
  12:     #region Properties
  13:     public IList<string> CurrentResults
  14:     {
  15:         get
  16:         {
  17:             return _currentResults;
  18:         }
  19:     }
  20:     public string FilterText
  21:     {
  22:         get
  23:         {
  24:             return _filter;
  25:         }
  26:         set
  27:         {
  28:             _filter = value;
  29:         }
  30:     }
  31:     public IIndexer<string> Indexer
  32:     {
  33:         get
  34:         {
  35:             return _indexer;
  36:         }
  37:         set
  38:         {
  39:             _indexer = value;
  40:         }
  41:     }
  42:     public IList<string> AllLines
  43:     {
  44:         get
  45:         {
  46:             return _allLines;
  47:         }
  48:         set
  49:         {
  50:             _allLines = value;
  51:         }
  52:     }
  53:     #endregion
  54:     //methods, etc.
  55: }

Ok, lots of boring code. We track the top and bottom indexes of the visible items, we store our indexer, a list of all the results (for when no text is entered–no point in searching for nothing). _currentResults will point to either _allLines or the actual results. _itemCache is a list of ListViewItem objects that will serve as our cache. And, of course, we need the text we want to filter on.

You’ll notice that the filtered results are now strings instead of bools. In previous articles, we didn’t care about the actual results–just a count of them. Now, we need to display them, so the filtered items are the same as the keys–the lines of the text in a file.

Constructor

Construction is likewise straightforward:

   1: public QuickListController(QuickList list)
   2: {
   3:     _listControl = list;
   4:     _listControl.VirtualMode = true;
   5:     _listControl.CacheVirtualItems += 
   6:         new CacheVirtualItemsEventHandler(list_CacheVirtualItems);
   7:     _listControl.RetrieveVirtualItem += 
   8:         new RetrieveVirtualItemEventHandler(list_RetrieveVirtualItem);
   9: }

Our controller requires an instance of a QuickList object, which it then sets to virtual mode and listens for two events. The first, CacheVirtualItems, is triggered when the ListView needs a new range of objects. It gives us an opportunity to see what the new top and bottom will be and update our cache accordingly.

The RetrieveVirtualItem event is called when the ListView needs to get the item to display. As we’ll see below, handling both of these events is fairly straightforward.

CacheVirtualItems

 

   1: void list_CacheVirtualItems(object sender, CacheVirtualItemsEventArgs e)
   2: {
   3:     _topIndex = e.StartIndex;
   4:     _bottomIndex = e.EndIndex;
   5:  
   6:     int sizeNeeded = _bottomIndex - _topIndex + 2;//extra 2
   7:     int toCreate = sizeNeeded - _itemCache.Count;
   8:     if (toCreate > 0)
   9:     {
  10:         if (_itemCache.Capacity < sizeNeeded)
  11:         {
  12:             _itemCache.Capacity += toCreate;
  13:         }
  14:         for (int i = 0; i < toCreate; i++)
  15:         {
  16:             _itemCache.Add(new ListViewItem());
  17:         }
  18:     }
  19: }

We save the top and bottom indexes (we actually never use the bottom index), and we then calculate how many total items we need, how many new ones we need to create, and then create new ones if necessary.

You’ll notice that there is no mechanism for removing items from the cache. As we know, another name for cache with a bad (no) removal policy is a memory leak, but since the size of this cache is limited by screen characteristics, and in the most dire circumstance shouldn’t be more than a few hundred, we’ll let it slide (just this once).

RetrieveVirtualItem

 

   1: void list_RetrieveVirtualItem(object sender, RetrieveVirtualItemEventArgs e)
   2: {
   3:     if (e.ItemIndex < _topIndex)
   4:     {
   5:         //should never get here
   6:         Debug.Assert(false);
   7:         _topIndex = e.ItemIndex;
   8:     }
   9:     
  10:     e.Item = GetListViewItem(e.ItemIndex, _topIndex);
  11: }

You’ll notice that if block shouldn’t ever be called, but during debugging I did have an instance where that was the case. It may have been a bug that has since been fixed, but I left it in for safety.

The actual work is done another function:

GetListViewItem

 

   1: private ListViewItem GetListViewItem(int itemIndex, int topIndex)
   2: {
   3:     if (itemIndex < 0 || topIndex < 0)
   4:     {
   5:         return null;
   6:     }
   7:     int cacheIndex = itemIndex - topIndex;
   8:     
   9:     Debug.Assert(cacheIndex < _itemCache.Count);
  10:     ListViewItem item = _itemCache[cacheIndex];
  11:     item.Text = _currentResults[itemIndex];
  12:     //could set tag if you have more complex objects
  13:     //item.Tag = _currentResults[itemIndex];
  14:     return item;
  15: }

Now that you understand how this is working, this function is really simple. Again, we do a simple sanity check. We calculate the index into our cache, which is simply the offset from the top index. After retrieving the pre-created item, we set its text (you could also set the item’s tag, color, or any other property here).

Now, some of you are wondering, “but wait, where do you set the VirtualListSize property?” Yes, that is rather important, isn’t it? How is the ListView control supposed to know how many items to ask for? For that matter, where is the actual filtering done?

The Filter

We have one more function, this one a public interface to the class’s consumer:

 

   1: public void RefreshList()
   2: {
   3:     if (string.IsNullOrEmpty(_filter))
   4:     {
   5:         _currentResults = _allLines;
   6:     }
   7:     else
   8:     {
   9:         _currentResults = _indexer.Lookup(_filter);
  10:     }
  11:     _listControl.VirtualListSize = _currentResults.Count;
  12: }

If the filter text is empty, we set our current results to the full set of lines. Otherwise, run the filter. Then we set our control’s virtual size to the number of results.

Conclusion

If you’ve struggled this far, congratulations. If you find anything in any of these articles useful, please leave a comment and let me know! If you find bugs or problems with my approach, I’d love to hear about better ways of accomplishing things.

The project download contains a GUI that hooks up these classes to an easy-to-use interface.

Also, if you’ve found this useful, please Buy me a Lego!

Download the project. (run the GuiSearch project–I changed TestSearch to a library file)

Download Les Misérables for testing.

 

Technorati Tags: , , , ,

Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Easily Getting Last Two Digits of Year

This is an easy one, but some people don’t know it: if you need the last two digits of a year, say 2007, it’s very easy to get, without converting to a string and getting the last two characters:

 

int lastTwo = year % 100;

Now that you know, it’s obvious, right?

And if you want the last three digits? number % 1000. In general:

int lastN = number % (pow(10, digitsRequired));

What about first N digits? similarly easy:

 

int firstTwo = number / 100;
int firstN = number / pow(10,digitsRequired);

 
Now  you know, and knowing if half the battle.

Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Instant Searching and Filtering in .Net – Part 3

This is part three of my series on fast searching and filtering of text using C#.

The previous article developed an indexing method using a hash table. This article develops a method using a trie structure. If you don’t know tries, I highly encourage to go read about them before continuing.

This filtering method is much more complex than previous versions, but we’ll take it one step at a time.

Trie Overview

Our data structure is more complex than a hash table, and definitely more involved than a simple list. We start with a data structure called a trie that contains a list of results, and links to the next nodes, indexed by letter. The key represented by a trie is determined by the path to that trie from the root. The root represents all keys and values (or none, depending on how you look at it). The trie is built up as we index items, beginning with an empty root. An illustration would be helpful about now:

TrieStructure

This diagram shows the top portion of an example trie structure. The root node has no results (since there are no values in the path to that node). The _next variable indexes the next links in the chain. Here, there is only one link–to ‘h’. The node with the value “h” has two further links–to ‘e’ and ‘a’.

So far, so good. However, in our implementation, we limit the depth of this tree to three so the results will potentially contain many entries. For example, if we followed the ‘l’ link in the “he” node, we would get to a node with the value “hel”. If we then indexed the word “hello”, the same node would then contain “hel” and “hello” because we’ve stopped the tree growth here. You can experiment with different values, but I found limited value beyond 3.

Adding Overview

To add a new item, we need to get substrings of the key, like usual, but unlike before, we don’t need to retrieve all substrings, just the longest substrings that aren’t the initial portions of a substring already found. Clear?

No? Here’s an example:

With the previous indexer (using the hash table), the substrings of “hello” would be:

h, he, hel, e, el, ell, l, ll, llo, lo, o

Now, think about our trie structure. Is there any reason to consider ‘h’ if we’re going to consider ‘he’ anyway? Why not just store our value under the results for “he”, and then if we filter on ‘h’ , just return the unified results of the ‘h’ node and every subnode under it. We’ve just cut our memory requirements substantially.

For a trie, we only need substrings of our tree height (or shorter if a longer one doesn’t exist–i.e., at the end of keys). If our maximum keylength/tree-height is 3, then the substrings required for “hello” are now just:

hel, ell, llo, lo, o

Nice.

Once we have these substrings, we can generate the trie, inserting our value (and full key) into the results of the bottom-most trie node we can reach, creating new trie nodes as necessary.

Looking up Overview

To find all values pertaining to a filter text, we start with the root node, and look up the filter text character by character, traversing the trie. If we run out of nodes, then there are no results. IF we run out of characters the results will include all the results in the current node as well as the results in subnodes.*

(*FYI: if we stored all the results in all applicable nodes, i.e., if we stored the value “Hello” in the nodes “hel”, “he” and “h”, we could speedup searches, BUT we’d use a lot more memory…and it would be exactly equivalent to the hash table implementation–we wouldn’t gain anything.)

Data Structures

With that overview out of the way, let’s cover the basic data structures we’ll need to pull this off.

The first is something we’re familiar with, the IndexStruct–which is actually a class like in the previous article:

 1: private class IndexStruct
 2: {
 3: public UInt32 sortOrder;
 4: public string key;
 5: public T val;
 6:  
 7: public IndexStruct(string key, T val, UInt32 sortOrder)
 8: {
 9: this.key = key;
 10: this.val = val;
 11: this.sortOrder = sortOrder;
 12: }
 13: };

This time, the sort order will be important as we’ll see below.

Trie

The next data structure we need is a trie. Technically, a trie is the whole tree, but since you can define a tree as a node with subtrees, it works to define a node too.

Here are the fields and properties we’ll need for the Trie class:

 1: private class Trie
 2: {
 3: private char _val;
 4: private Dictionary<char, Trie> _next = new Dictionary<char, Trie>();
 5: private IList<IndexStruct> _results;
 6:  
 7: #region Properties
 8: public char Value
 9: {
 10: get
 11: {
 12: return _val;
 13: }
 14: set
 15: {
 16: _val = value;
 17: }
 18: }
 19:  
 20: public IList<IndexStruct> Results
 21: {
 22: get
 23: {
 24: return _results;
 25: }
 26: }
 27:  
 28: public Dictionary<char, Trie> Next
 29: {
 30: get
 31: {
 32: return _next;
 33: }
 34: }
 35:  
 36: #endregion
 37:  
 38: //...methods to follow...
 39: }

The _val fields simply refers to the character this trie represents, though we don’t actually need it. I included it just for completeness. The _next dictionary associates the next character with another Trie object. The _results list stores all items that need to be stored in the current Trie object. The properties just wrap the fields.

The constructor is simple:

 1: public Trie(char val)
 2: {
 3: _val = val;
 4: }

We need a function to add a result to the trie’s list of results:

 1: public void AddValue(IndexStruct indexStruct)
 2: {
 3: if (_results == null)
 4: {
 5: _results = new List<IndexStruct>();
 6: }
 7: _results.Add(indexStruct);
 8: }

Notice, that the node only adds results to itself–it doesn’t try to figure out if a result belongs in itself or a sub-node in _next. To do that would require that each node have knowledge about keys as well as its place in the entire tree. This logic is better kept at a higher level.

We also need to associate a new trie node with a character:

 1: public void AddNextTrie(char c, Trie nextTrie)
 2: {
 3: this.Next[c] = nextTrie;
 4: }

OK, now we need to get into something more complex. We need a function to return all the results in the current trie, plus all the results in subnodes. Also, they should all be sorted. We could just return a huge list and sort them at the highest level with something like quicksort, but I have some additional knowledge about the data. I know that it’s being added in sorted order (since that’s how I define it. If it weren’t, we could presort each node anyway once indexing was done.) So let’s assume that each node’s results are already sorted. If that’s the case, we have an ideal setup for a merge sort!

Now, it’s entirely possible that just doing quicksort at the end would be faster, but in my tests at least, the merging seemed a little faster, especially since the results were mostly sorted, which is quicksort’s worst-case scenario.

So let’s define a helper function that takes two lists of IndexStruct and returns a single merged list:

 

 1: private IList<IndexStruct> Merge(IList<IndexStruct> a, IList<IndexStruct> b)
 2: {
 3: if (a == null || a.Count == 0)
 4: {
 5: return b;
 6: }
 7: if (b == null || b.Count == 0)
 8: {
 9: return a;
 10: }
 11: int iA = 0, iB = 0;
 12: List<IndexStruct> results = new List<IndexStruct>(a.Count + b.Count);
 13: while (iA < a.Count || iB < b.Count)
 14: {
 15: if (iA < a.Count && 
 16: (iB == b.Count || a[iA].sortOrder < b[iB].sortOrder))
 17: {
 18: results.Add(a[iA]);
 19: iA++;
 20: }
 21: else if (iA<a.Count && iB<b.Count && 
 22: a[iA].sortOrder == b[iB].sortOrder)
 23: {
 24: //if they're equal, make sure we skip the other 
 25: //one so we don't add it later
 26: results.Add(a[iA]);
 27: iA++;
 28: iB++;
 29: }
 30: else
 31: {
 32: results.Add(b[iB]);
 33: iB++;
 34: }
 35: }
 36: return results;
 37: }

This function interweaves the two arrays into a single array by always grabbing the smallest item from the next positions in the two lists. Look at lines 21-28. This is critical. We need to be on the lookout for identical sortorders, which indicate the same values. We don’t want to include those twice. This situation comes up when our value is “hi-ho” and we filter on “h”–results for both “hi” and “ho” will appear, and we don’t need both.

OK, so we we have a generic merge function. Now let’s make it work on the Trie:

 

 1: public IList<IndexStruct> GetCombinedResults()
 2: {
 3: Queue<IList<IndexStruct>> queue = new Queue<IList<IndexStruct>>();
 4: //first enqueue items in this node
 5: if (_results != null && _results.Count > 0)
 6: {
 7: queue.Enqueue(_results);
 8: }
 9: 
 10: //get items from sub-nodes
 11: foreach (Trie t in _next.Values)
 12: {
 13: IList<IndexStruct> r = t.GetCombinedResults();
 14: queue.Enqueue(r);
 15: }
 16:  
 17: //merge all items together
 18: while (queue.Count > 1)
 19: {
 20: IList<IndexStruct> a = queue.Dequeue();
 21: IList<IndexStruct> b = queue.Dequeue();
 22: queue.Enqueue(Merge(a, b));
 23: }
 24: return queue.Dequeue();
 25: }

 

We maintain the list of items to merge together with a queue. We first enqueue the items in this node, then we call GetCombinedResults() for all subnodes, and enqueue those results. Finally, we merge the queued lists together, enqueueing the result, until a single list is formed.

Phew! OK, now let’s look at the rest of the indexer.

Some Helper Methods

RemoveUnneededCharacters is the same as before:

 

 1: private string RemoveUnneededCharacters(string original)
 2: {
 3: char[] array = new char[original.Length];
 4: int destIndex = 0;
 5: for (int i = 0; i < original.Length; i++)
 6: {
 7: char c = original[i];
 8: if (char.IsLetterOrDigit(c))
 9: {
 10: array[destIndex] = c;
 11: destIndex++;
 12: }
 13: }
 14: return new string(array, 0, destIndex);
 15: }

We talked about our new version of GetSubStrings() above, and here it is:

 

 1: private List<string> GetSubStrings(string key)
 2: {
 3: List<string> results = new List<string>();
 4: //we only need to return substrings that
 5: //themselves don't begin other substrings
 6: //easy to do--return first _maxKeyLength characters
 7: //or whatever's left, if shorter
 8: for (int start = 0; start < key.Length; start++)
 9: {
 10: int len = Math.Min(_maxKeyLength, key.Length - start);
 11: string sub = key.Substring(start, len);
 12: //remove this if() to speed up index creation 
 13: //at the cost of slightly longer lookup time
 14: if (key.IndexOf(sub) == start)
 15: {
 16: results.Add(sub);
 17: }
 18: }
 19: return results;
 20: }

As you can see, it’s mostly comments. Before we add our substring we make sure it’s the first occurence in the string of that particular sequence of characters (“don’t begin other substrings”). This prevents us from indexing “he” twice in the word “hehe”.

TrieIndexer

Now, to the real stuff! Actually, most of our work is done for us. So let’s declare our indexer:

 1: class TrieIndexer<T> : IIndexer<T>
 2: {
 3: private Trie _rootTrie = new Trie(/*null char goes here--HTML doesn't like it/*);
 4: private int _maxKeyLength = 3;
 5:  
 6: public TrieIndexer(int maxKeyLength)
 7: {
 8: _maxKeyLength = maxKeyLength;
 9: }
 10: //..methods to follow...
 11: }

We have our root trie node with a null character, and the constructor takes the maximum key length (which corresponds directly to the height of trie).

AddItem

 

 1: public void AddItem(string key, T value, UInt32 sortOrder)
 2: {
 3: string toAdd = key.ToLower();
 4:  
 5: toAdd = RemoveUnneededCharacters(toAdd);
 6: 
 7: IndexStruct indexStruct = new IndexStruct(toAdd, value, sortOrder);
 8:  
 9: List<string> subStrings = GetSubStrings(toAdd);
 10: 
 11: foreach (string ss in subStrings)
 12: {
 13: Trie currentTrie = _rootTrie;
 14: for (int i = 0; i < ss.Length; i++)
 15: {
 16: char c = ss[i];
 17: Trie nextTrie = null;
 18: if (!currentTrie.Next.TryGetValue(c, out nextTrie))
 19: {
 20: nextTrie = new Trie(c);
 21: currentTrie.AddNextTrie(c, nextTrie);
 22: }
 23: currentTrie = nextTrie;
 24: }
 25: currentTrie.AddValue(indexStruct);
 26: }
 27: }

Lines 3-10 are the typical preprocessing of the key, and creating a structure for it and the value to live. The fun stuff happens in 12-30. With each substring, we begin at the root, and try to branch out node-by-node, character-by-character to find the bottom-most trie in which to place our indexed value. If the next one doesn’t exist, we create it. Once we get to the bottom of the tree for this subkey we add our value to the structure.

Lookup

Now let’s turn our attention to the filtering part. It’s almost like adding new values:

 

 1: public IList<T> Lookup(string filterText)
 2: {
 3: Trie currentTrie = _rootTrie;
 4: 
 5: int maxTrieLength = Math.Min(filterText.Length, _maxKeyLength);
 6: for (int i = 0; i < maxTrieLength; i++)
 7: {
 8: Trie nextTrie = null;
 9: if (currentTrie.Next.TryGetValue(filterText[i], out nextTrie))
 10: {
 11: currentTrie = nextTrie;
 12: }
 13: else
 14: {
 15: //no results
 16: return new List<T>();
 17: }
 18: }
 19: Debug.Assert(currentTrie != null);
 20: 
 21: return GetResults(filterText, currentTrie);
 22: }

We set our current node to the root. We then figure out the length of the path we need to traverse. If our filter text is longer than our maximum key length, we only want to go as far as the key length (and vice versa).

We do the same type of lookups as in AddItem, but this time if the next node isn’t present, we just return an empty list–there were no results for that filter text.

Once we find the target node, we call another function to actually compile the results for us:

 

 1: private IList<T> GetResults(string filterText, Trie trie)
 2: {
 3: List<T> results = new List<T>();
 4:  
 5: IList<IndexStruct> preResults = trie.GetCombinedResults();
 6: if (preResults.Count <= 0)
 7: {
 8: return results;
 9: }
 10: results.Capacity = preResults.Count;
 11: uint prevSortOrder = 0;
 12: foreach (IndexStruct item in preResults)
 13: {
 14: Debug.Assert(item.sortOrder > prevSortOrder);
 15: prevSortOrder = item.sortOrder;
 16:  
 17: if (filterText.Length <= _maxKeyLength || 
 18: item.key.IndexOf(filterText, 
 19: StringComparison.InvariantCultureIgnoreCase) >= 0)
 20: {
 21: results.Add(item.val);
 22: }
 23: }
 24: return results;
 25: }

In line 5, we call GetCombinedResults() and store it in the variable preResults. Why preResults? Why aren’t they final? For one, they are the entire IndexStruct object, and we still need to extract the values. Secondly, it’s possible the user entered a longer filter text than we indexed, so we still have to do a linear walk of the list and do string searches to make sure the preResult is really valid. Thankfully, we can avoid this if the filterText is short.

Summary

And now…we’re done! This one was a beast! There are still some optimizations to be found in here, but it’s pretty good. OK, so what about performance of this thing?

Let’s run it on lesmis.txt first:

testsearch_lesmis_trie

OK, for raw speed it’s actually slower than naive–overall. But look closer. Nearly ALL the time penalty is coming from the lookup of “h”. The next results are over ten times faster than naive. Now, let’s compare to the substring method. The trie method is overall slower still, and the search times are comparable (except for the initial search). But, look at memory usage: the trie method uses less than HALF what the substring method used. Also, the index creation time is 4 times faster. Mixed bag, but impressive none-the-less.

Now let’s run it on huge.txt:

testsearch_huge_trie

Ouch, over a second–but again, it’s all because of that initial search for “h”. All the other times are about the same. Comparing to the substring method, it uses almost a third of the memory, and takes a third of the time to create the index.

So what can we say about this method?

Pros:

  • finding 0 results can be VERY fast (3-4 lookups in hash-tables to determine if a short filter text isn’t present).
  • memory use is much better than enormous hash tables.
  • Index creation time is better than other method.
  • Searching for filter texts longer than a single character can be very fast.

Cons:

  • short filter text is bad–has to combine lots of trie nodes.

(Note: in the course of writing this article, I changed my implementation to no longer need the Finish() function, which was part of the IIndexer<T> interface. I know I promised we’d use it, but I don’t need it!)

Download project files

Next time…

So where do we go from here? I have some notes about implementing this paired with a ListView control. Stay tuned for part 4!

 


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Instant Searching and Filtering in .Net – Part 2

This is part two of my series on fast searching/filtering of text using C#.

In the previous article, we developed the filtering interface, built up a testing framework and implemented a naive indexer. For many purposes, that indexer performs more than adequately. Still, there are other possible implementations that might work better (or not…let’s wait and see).

SubString Indexer

In this installment, let’s develop something based on hash tables. With O(1) look-up time, they could be the ticket to blazing fast lookups.

We reuse the same internal structure as the NaiveIndexer, except I’ve changed it to a class. It needs to be reused many, many times for the same value so it’s much more efficient to share the object around instead of make copies:

   1: private class IndexStruct
   2: {
   3:     public string key;
   4:     public T val;
   5:  
   6:     public IndexStruct(string key, T val)
   7:     {
   8:         this.key = key;
   9:         this.val = val;
  10:     }
  11: };

The fields are thus:

   1: private int _maxKeyLength = 999;
   2: private Dictionary<int, List<IndexStruct>> _hashes;

Notice again that we still don’t have to keep track of sort order in the IndexStruct. This is because down at the bottom of the data structure, all the data is still stored in List<> objects. I’m still breaking the law of abstraction by associating sort order to the order in which I add items, but…hey, it’s just an example.

Constructor

Let’s take a look at the constructor and figure out what the maximum key length is for, and a few other issues we have to handle:

   1: public SubStringIndexer(int numItems, int maxKeyLength)
   2: {
   3:     _maxKeyLength = maxKeyLength;
   4:     _hashes = new Dictionary<int, List<IndexStruct>>(numItems);
   5: }

As you can see, the constructor takes two arguments.

The number of items is used to initialize the Dictionary<> ( a hash table class) to the number of items we can expect. If you know you’re hash table theory,  you know that the size of a hash table should ideally be a prime number much larger than the number of items you want to insert. I have not bothered to do that here and experimentation did not show a significant benefit. Also, the hash table will automatically expand itself anyway for a certain load factor.

The maxKeyLength parameter is crucial. Since this indexer works by calculating substrings, it’s important to specify just how big those substrings can be. There’s a tradeoff here. The longer the maximum, the more substrings can be precalculated, and the faster the searches will be. However, you pay an enormous price in memory usage. We’ll see that price below when we run this example. I’ve chosen 3 as a fairly good balance between speed and space.

Helper Functions

Before we override our interface methods, let’s define some helper functions we’ll need. The first is something we’re familiar with:

   1: private string RemoveUnneededCharacters(string original)
   2: {
   3:     char[] array = new char[original.Length];
   4:     int destIndex = 0;
   5:     for (int i = 0; i < original.Length; i++)
   6:     {
   7:         char c = original[i];
   8:         if (char.IsLetterOrDigit(c))
   9:         {
  10:             array[destIndex] = c;
  11:             destIndex++;
  12:         }
  13:     }
  14:     return new string(array, 0, destIndex);
  15: }

Just as with the naive indexer (in fact, with all the indexers), we need to strip out unimportant characters.

One of the most important functions we’ll need is something to generate substrings given a key.

   1: private List<string> GetSubStrings(string key)
   2: {
   3:     List<string> results = new List<string>();
   4:  
   5:     for (int start = 0; start < key.Length; start++)
   6:     {
   7:         /*get maximum length of substring based on current
   8:          * character position (constrain it to within the string
   9:          * and less than or equal to the maximum key length specified
  10:          * */
  11:         int lastLength = Math.Min(key.Length - start, _maxKeyLength);
  12:  
  13:         /* Get each substring from length 1 to lastLength
  14:          */
  15:         for (int length = 1; length <= lastLength; length++)
  16:         {
  17:             string sub = key.Substring(start, length);
  18:             if (!results.Contains(sub))
  19:             {
  20:                 results.Add(sub);
  21:             }
  22:         }
  23:     }
  24:     return results;
  25: }

GetSubStrings returns a list of all substrings of length 1 to _maxKeyLength. If you think about it, you can see why limiting this number to a small number is a good idea. If you have thousands of different keys, each of a fairly sizable length, you will generate thousands and thousands of unique substrings, not to mention how long it will take (a very long time).

Now let’s look at how this indexer works.

Overview

Here’s the way it works. Each substring of a key is converted to a hash number, which is the index into the hash table. The hash of the substring is used instead of the substring itself to avoid storing the substrings in the hash table’s list of keys–just for memory reasons.

The value of each slot in the table is a list of items. Lookup works by first narrowing down the list by doing a hash lookup on the filter text, then doing a linear search through all the items returned from the hash table. We’ll see the details below.

AddItem

   1: public void AddItem(string key, T value, UInt32 sortOrder)
   2: {
   3:     string toAdd = key.ToLower();
   4:  
   5:     toAdd = RemoveUnneededCharacters(toAdd);
   6:  
   7:     List<string> subStrings = GetSubStrings(toAdd);
   8:     IndexStruct indexStruct = new IndexStruct(toAdd, value);
   9:     foreach (string str in subStrings)
  10:     {
  11:         List<IndexStruct> items = null;
  12:         int hash = str.GetHashCode();
  13:  
  14:         bool alreadyExists = _hashes.TryGetValue(hash, out items);
  15:         if (!alreadyExists)
  16:         {
  17:             items = new List<IndexStruct>();
  18:             _hashes[hash] = items;
  19:         }
  20:         items.Add(indexStruct);
  21:     }
  22: }

After normalizing the key (lower-case, alphanumeric), we get the valid substrings. For each of those, we calculate it’s hash and try to look it up in our hash table. If it doesn’t exist, we create a new list for that substring. Then we add the new entry to that list.

Lookup

Lookup is a little more complicated, but still straightforward enough.

   1: public IList<T> Lookup(string subKey)
   2: {
   3:     string toLookup = subKey.ToLower();
   4:     List<IndexStruct> items = null;
   5:     List<T> results = new List<T>();
   6:     int hash = 0;
   7:  
   8:     if (subKey.Length > _maxKeyLength)
   9:     {
  10:         /*
  11:          * If the substring is too long, get the longest substring 
  12:          * we've indexed and use that for the initial search
  13:          */
  14:         toLookup = toLookup.Substring(0, _maxKeyLength);
  15:         hash = toLookup.GetHashCode();
  16:     }
  17:     else
  18:     {
  19:         hash = toLookup.GetHashCode();
  20:     }
  21:  
  22:     bool found = _hashes.TryGetValue(hash, out items);
  23:     if (found)
  24:     {
  25:         results.Capacity = items.Count;
  26:         foreach (IndexStruct s in items)
  27:         {
  28:             /*
  29:              * Have to check each item in this bucket's list
  30:              * because the substring might be longer than the indexed
  31:              * keys
  32:              */
  33:             if (s.key.IndexOf(subKey, StringComparison.InvariantCultureIgnoreCase) >= 0)
  34:             {
  35:                 results.Add(s.val);
  36:             }
  37:         }
  38:     }
  39:     
  40:     return results;
  41: }

We first convert the subKey parameter (what we’re searching on) to lower case to normalize it. If that text is longer than the maximum subkey we’ve indexed, we trim it down to match the maximum size. Then we calculate the hash code and see if it’s in our list. If it isn’t found, there are no results and we return the empty list.

If a list was returned, we still have to go search through the entire list and do string searches to make sure our entire subkey is present in the key before adding it to the result set.

(If we needed to be concerned about sort order, we would do another post-processing step and sort the results list by sortorder.)

Testing

So let’s see how this works in practice by running it against the same lesmis.txt file.

testsearch_capture2

It’s nearly 3 times faster! But at what cost? It now takes about 8 seconds to create the index in the first place, and it uses 31 MB of memory. Ouch!

But I wonder….what if I run this against truly huge data sets…

I create a million-line file out of various books available at the Gutenberg project. First, let’s run the naive indexer:

testsearch_huge_naive

Now the naive way is taking over a second to do the search. What about our new substring indexer?

testsearch_huge_substring

It takes nearly a minute to create the index, but lookups are now almost 5 times faster than the naive version–it definitely scales better. Well, except for that 270 MB index size!

Summary

So, this is an improvement in some ways, but it has some big costs (index creation time, index size). Next time, I’ll show yet another way that has some advantages.

Download sample project.

 


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order:

Instant Searching and Filtering in .Net – Part 1

Even though I recently wrote about just using naive algorithms when they’re sufficient, it helps to know about other options and their characteristics. With that mind, I’m beginning a little series (4 parts planned–I’ll update this list as I go along) in C# documenting how I developed a few different approaches to doing fast (instant) filtering of large lists.

The article index:

  1. Getting Started – Indexing interface, test driver, and naive algorithm (this article)
  2. SubString indexer
  3. Trie indexer
  4. Efficient usage of ListView with filtering

Why Filtering?

I recently changed some internal apps at work do use filtering instead of column sorting of large listviews. It took a few days for the users to get used to it, but I’ve gotten comments back that it is amazingly better. The reason is that you are hiding all sorts of data you don’t need to look at. This is better in many cases than sorting, especially if you’re sorting 10,000 items. It’s also fairly intuitive for users to use.

The Requirements

Our indexer must have certain capabilities, so says me:

  1. Store any type of data
  2. Keys are strings
  3. Searches can be done on subkeys (i.e., if a key is “valjean”, doing a search for “lje” will work.
  4. Maintain sorted order, if desired

The Interface

The first thing we need to do is define a common interface that all our indexers will implement. This will make it very easy to swap them out when comparing different implementations.

Before showing the code, let’s discuss exactly what the indexer needs to do. Basically, the indexer has to accomplish two tasks:

  1. Index an item according to its key
  2. Lookup a key or subkey and return a list of items

We’ll assume that all keys will be strings because the point here is to do search, and search require strings. The actual values stored in the index, however, can be anything, so let’s make sure the interface supports generics.

Another potential problem is sort order. The internal index representation may not respect the order (for example, if a hash table were used, items are definitely not maintained in a sorted order). Our interface should provide a way to handle this.

It’s also possible that an indexer may want to know when indexing is finished so it can clean up any temporary data it may have created. Our first example will not use this, but we’ll put it in the interface now anyway.

So without further ado, here is our interface in all her glory:

   1: interface IIndexer<T>
   2: {
   3:     void AddItem(string key, T value, UInt32 sortOrder);
   4:  
   5:     IList<T> Lookup(string subKey);
   6:  
   7:     void FinishIndex();
   8: }

Just three tiny functions. That’s all we need to get started.

Naive Indexer

With our interface designed we can quickly build a simple indexer. The algorithm for this one is brain-dead simple:

  • Store key/value pairs in a list when added
  • When a search is done, loop through all keys and do a substring lookup on each key. If substring is in key, add that value to a list. Return the list when done.

I did say this was naive.

Let’s start our implementation by deriving a new class from IIndexer<T>:

   1: class NaiveIndexer<T> : IIndexer<T>
   2: {
   3:     
   4: }

No constructor is needed so let’s jump into the data structures required. For one, we need to store the key and value we’re adding, so let’s make a private structure to hold those together as well as a member variable list of structures to hold our data.

   1: private struct ItemStruct
   2: {
   3:     public string _key;
   4:     public T _value;
   5:     //public ushort _sortOrder;
   6: };
   7:  
   8: List<ItemStruct> _items = new List<ItemStruct>();

I include the sort order merely to show where it could go. I’ve left it commented it out in my implementation because I’m adding things in sorted order and the List<T> will keep things sorted for me. This is probably breaking the abstraction, but it’s easy enough for you to add it back in if you want. (Like I said, I wanted this to be as easy and light-weight as possible, so a leaky abstraction is acceptable in this case).

With these data structures, adding a new item is a piece of cake. We just create a new instance of the structure, set the fields, and add it to _items.

   1: public void AddItem(string key, T value, UInt32 sortOrder)
   2: {
   3:     string realKey = RemoveUnneededCharacters(key);
   4:     ItemStruct itemStruct = new ItemStruct();
   5:     itemStruct._key = realKey;
   6:     itemStruct._value = value;
   7:     //itemStruct._sortOrder = sortOrder;
   8:     //could insert into place of sortOrder, after a grow, if desired
   9:     _items.Add(itemStruct);
  10: }

Woah, hold on! RemoveUnneededCharacters? Well, in my application I only wanted to index alphanumeric characters. This function strips all others from a string and returns the “real” key to use. Of course, during lookups, you’ll have to be similarly careful to sanitize the input to prevent searching on stripped characters.

   1: private string RemoveUnneededCharacters(string original)
   2: {
   3:     char[] array = new char[original.Length];
   4:     int destIndex = 0;
   5:     for (int i = 0; i < original.Length; i++)
   6:     {
   7:         char c = original[i];
   8:         if (char.IsLetterOrDigit(c))
   9:         {
  10:             array[destIndex] = c;
  11:             destIndex++;
  12:         }
  13:     }
  14:     return new string(array, 0, destIndex);
  15: }

I think that’s a fairly efficient way of stripping characters that doesn’t rely on creating temporary strings, but if you have a better way, I’d love to see it.

Doing searches is similarly straightforward. We loop through each ItemResult in _items and see if our subkey is in the key. If it is, add it to a result list. After all are searched, return the result list.

   1: public IList<T> Lookup(string subKey)
   2: {
   3:     List<T> results = new List<T>();
   4:     foreach (ItemStruct itemStruct in _items)
   5:     {
   6:         if (itemStruct._key.IndexOf(subKey, 
   7:             StringComparison.InvariantCultureIgnoreCase)>=0)
   8:         {
   9:             results.Add(itemStruct._value);
  10:         }
  11:     }
  12:     //Sort on results[i]._sortOrder, if desired
  13:     return results;
  14: }

And that’s it! We now have a fully-functioning naive indexer. Now, let’s see how can test it and set the foundation for comparison of all the indexers we’re build.

Test Harness

I’m not going to include all of the code for the test harness here–you can find it in the sample code download. A simple description of it will suffice.

The test harness will take as input a filename, the number of items to index, what to search for (or filter on), and the indexing method. This will allow us to easily add other indexing methods as we develop them. Of course, being a simple test harness, there is no error-handling.

The items to be indexed will be lines from a text file. I’ve supplied the text of Les Misérables by Victor Hugo from Project Gutenburg, but you can use any text file you want (the larger the better). Les Misérables has about 70,000 lines. This is actually fairly small-medium.

Since, the reason for creating these indexers arose out of search-as-you-type functionality in one of my apps, the harness does progressive lookups on successive substrings of the key you type. I.e., if you do a search on “valjean”, it will first search “v”, then “va”, then “val”, etc.

Here is a screenshot of a sample run (click to enlarge):

testsearch_capture

 

Optimization

I mentioned above that the need for this arose from doing filter-as-you-type functionality in one of my applications. This realization can lead to a major optimization, which I have not implemented in this example. If you’re doing successive searches, first searching for “v”, then “va”, then “val”, etc. You can cache the search filter and results of the previous query, and then on the next search instead of looking through the entire _items list, you can just look through the cached results instead. First, you just check to make sure that the previous filter is a substring of the current filter.

Summary

Next time, I’ll develop another indexing method that has some additional advantages and disadvantages. Using the test hardness, we can compare the different algorithms under different conditions. Stay tuned.

(I’m out of town this weekend, so comments will be approved after I return home)

Sample Project Download

Les Misérables

 


Check out my latest book, the essential, in-depth guide to performance for all .NET developers:

Writing High-Performance.NET Code, 2nd Edition by Ben Watson. Available for pre-order: