Tag Archives: programming

Instant Searching and Filtering in .Net – Part 3

This is part three of my series on fast searching and filtering of text using C#.

The previous article developed an indexing method using a hash table. This article develops a method using a trie structure. If you don’t know tries, I highly encourage to go read about them before continuing.

This filtering method is much more complex than previous versions, but we’ll take it one step at a time.

Trie Overview

Our data structure is more complex than a hash table, and definitely more involved than a simple list. We start with a data structure called a trie that contains a list of results, and links to the next nodes, indexed by letter. The key represented by a trie is determined by the path to that trie from the root. The root represents all keys and values (or none, depending on how you look at it). The trie is built up as we index items, beginning with an empty root. An illustration would be helpful about now:

TrieStructure

This diagram shows the top portion of an example trie structure. The root node has no results (since there are no values in the path to that node). The _next variable indexes the next links in the chain. Here, there is only one link–to ‘h’. The node with the value “h” has two further links–to ‘e’ and ‘a’.

So far, so good. However, in our implementation, we limit the depth of this tree to three so the results will potentially contain many entries. For example, if we followed the ‘l’ link in the “he” node, we would get to a node with the value “hel”. If we then indexed the word “hello”, the same node would then contain “hel” and “hello” because we’ve stopped the tree growth here. You can experiment with different values, but I found limited value beyond 3.

Adding Overview

To add a new item, we need to get substrings of the key, like usual, but unlike before, we don’t need to retrieve all substrings, just the longest substrings that aren’t the initial portions of a substring already found. Clear?

No? Here’s an example:

With the previous indexer (using the hash table), the substrings of “hello” would be:

h, he, hel, e, el, ell, l, ll, llo, lo, o

Now, think about our trie structure. Is there any reason to consider ‘h’ if we’re going to consider ‘he’ anyway? Why not just store our value under the results for “he”, and then if we filter on ‘h’ , just return the unified results of the ‘h’ node and every subnode under it. We’ve just cut our memory requirements substantially.

For a trie, we only need substrings of our tree height (or shorter if a longer one doesn’t exist–i.e., at the end of keys). If our maximum keylength/tree-height is 3, then the substrings required for “hello” are now just:

hel, ell, llo, lo, o

Nice.

Once we have these substrings, we can generate the trie, inserting our value (and full key) into the results of the bottom-most trie node we can reach, creating new trie nodes as necessary.

Looking up Overview

To find all values pertaining to a filter text, we start with the root node, and look up the filter text character by character, traversing the trie. If we run out of nodes, then there are no results. IF we run out of characters the results will include all the results in the current node as well as the results in subnodes.*

(*FYI: if we stored all the results in all applicable nodes, i.e., if we stored the value “Hello” in the nodes “hel”, “he” and “h”, we could speedup searches, BUT we’d use a lot more memory…and it would be exactly equivalent to the hash table implementation–we wouldn’t gain anything.)

Data Structures

With that overview out of the way, let’s cover the basic data structures we’ll need to pull this off.

The first is something we’re familiar with, the IndexStruct–which is actually a class like in the previous article:

 1: private class IndexStruct
 2: {
 3: public UInt32 sortOrder;
 4: public string key;
 5: public T val;
 6:  
 7: public IndexStruct(string key, T val, UInt32 sortOrder)
 8: {
 9: this.key = key;
 10: this.val = val;
 11: this.sortOrder = sortOrder;
 12: }
 13: };

This time, the sort order will be important as we’ll see below.

Trie

The next data structure we need is a trie. Technically, a trie is the whole tree, but since you can define a tree as a node with subtrees, it works to define a node too.

Here are the fields and properties we’ll need for the Trie class:

 1: private class Trie
 2: {
 3: private char _val;
 4: private Dictionary<char, Trie> _next = new Dictionary<char, Trie>();
 5: private IList<IndexStruct> _results;
 6:  
 7: #region Properties
 8: public char Value
 9: {
 10: get
 11: {
 12: return _val;
 13: }
 14: set
 15: {
 16: _val = value;
 17: }
 18: }
 19:  
 20: public IList<IndexStruct> Results
 21: {
 22: get
 23: {
 24: return _results;
 25: }
 26: }
 27:  
 28: public Dictionary<char, Trie> Next
 29: {
 30: get
 31: {
 32: return _next;
 33: }
 34: }
 35:  
 36: #endregion
 37:  
 38: //...methods to follow...
 39: }

The _val fields simply refers to the character this trie represents, though we don’t actually need it. I included it just for completeness. The _next dictionary associates the next character with another Trie object. The _results list stores all items that need to be stored in the current Trie object. The properties just wrap the fields.

The constructor is simple:

 1: public Trie(char val)
 2: {
 3: _val = val;
 4: }

We need a function to add a result to the trie’s list of results:

 1: public void AddValue(IndexStruct indexStruct)
 2: {
 3: if (_results == null)
 4: {
 5: _results = new List<IndexStruct>();
 6: }
 7: _results.Add(indexStruct);
 8: }

Notice, that the node only adds results to itself–it doesn’t try to figure out if a result belongs in itself or a sub-node in _next. To do that would require that each node have knowledge about keys as well as its place in the entire tree. This logic is better kept at a higher level.

We also need to associate a new trie node with a character:

 1: public void AddNextTrie(char c, Trie nextTrie)
 2: {
 3: this.Next[c] = nextTrie;
 4: }

OK, now we need to get into something more complex. We need a function to return all the results in the current trie, plus all the results in subnodes. Also, they should all be sorted. We could just return a huge list and sort them at the highest level with something like quicksort, but I have some additional knowledge about the data. I know that it’s being added in sorted order (since that’s how I define it. If it weren’t, we could presort each node anyway once indexing was done.) So let’s assume that each node’s results are already sorted. If that’s the case, we have an ideal setup for a merge sort!

Now, it’s entirely possible that just doing quicksort at the end would be faster, but in my tests at least, the merging seemed a little faster, especially since the results were mostly sorted, which is quicksort’s worst-case scenario.

So let’s define a helper function that takes two lists of IndexStruct and returns a single merged list:

 

 1: private IList<IndexStruct> Merge(IList<IndexStruct> a, IList<IndexStruct> b)
 2: {
 3: if (a == null || a.Count == 0)
 4: {
 5: return b;
 6: }
 7: if (b == null || b.Count == 0)
 8: {
 9: return a;
 10: }
 11: int iA = 0, iB = 0;
 12: List<IndexStruct> results = new List<IndexStruct>(a.Count + b.Count);
 13: while (iA < a.Count || iB < b.Count)
 14: {
 15: if (iA < a.Count && 
 16: (iB == b.Count || a[iA].sortOrder < b[iB].sortOrder))
 17: {
 18: results.Add(a[iA]);
 19: iA++;
 20: }
 21: else if (iA<a.Count && iB<b.Count && 
 22: a[iA].sortOrder == b[iB].sortOrder)
 23: {
 24: //if they're equal, make sure we skip the other 
 25: //one so we don't add it later
 26: results.Add(a[iA]);
 27: iA++;
 28: iB++;
 29: }
 30: else
 31: {
 32: results.Add(b[iB]);
 33: iB++;
 34: }
 35: }
 36: return results;
 37: }

This function interweaves the two arrays into a single array by always grabbing the smallest item from the next positions in the two lists. Look at lines 21-28. This is critical. We need to be on the lookout for identical sortorders, which indicate the same values. We don’t want to include those twice. This situation comes up when our value is “hi-ho” and we filter on “h”–results for both “hi” and “ho” will appear, and we don’t need both.

OK, so we we have a generic merge function. Now let’s make it work on the Trie:

 

 1: public IList<IndexStruct> GetCombinedResults()
 2: {
 3: Queue<IList<IndexStruct>> queue = new Queue<IList<IndexStruct>>();
 4: //first enqueue items in this node
 5: if (_results != null && _results.Count > 0)
 6: {
 7: queue.Enqueue(_results);
 8: }
 9: 
 10: //get items from sub-nodes
 11: foreach (Trie t in _next.Values)
 12: {
 13: IList<IndexStruct> r = t.GetCombinedResults();
 14: queue.Enqueue(r);
 15: }
 16:  
 17: //merge all items together
 18: while (queue.Count > 1)
 19: {
 20: IList<IndexStruct> a = queue.Dequeue();
 21: IList<IndexStruct> b = queue.Dequeue();
 22: queue.Enqueue(Merge(a, b));
 23: }
 24: return queue.Dequeue();
 25: }

 

We maintain the list of items to merge together with a queue. We first enqueue the items in this node, then we call GetCombinedResults() for all subnodes, and enqueue those results. Finally, we merge the queued lists together, enqueueing the result, until a single list is formed.

Phew! OK, now let’s look at the rest of the indexer.

Some Helper Methods

RemoveUnneededCharacters is the same as before:

 

 1: private string RemoveUnneededCharacters(string original)
 2: {
 3: char[] array = new char[original.Length];
 4: int destIndex = 0;
 5: for (int i = 0; i < original.Length; i++)
 6: {
 7: char c = original[i];
 8: if (char.IsLetterOrDigit(c))
 9: {
 10: array[destIndex] = c;
 11: destIndex++;
 12: }
 13: }
 14: return new string(array, 0, destIndex);
 15: }

We talked about our new version of GetSubStrings() above, and here it is:

 

 1: private List<string> GetSubStrings(string key)
 2: {
 3: List<string> results = new List<string>();
 4: //we only need to return substrings that
 5: //themselves don't begin other substrings
 6: //easy to do--return first _maxKeyLength characters
 7: //or whatever's left, if shorter
 8: for (int start = 0; start < key.Length; start++)
 9: {
 10: int len = Math.Min(_maxKeyLength, key.Length - start);
 11: string sub = key.Substring(start, len);
 12: //remove this if() to speed up index creation 
 13: //at the cost of slightly longer lookup time
 14: if (key.IndexOf(sub) == start)
 15: {
 16: results.Add(sub);
 17: }
 18: }
 19: return results;
 20: }

As you can see, it’s mostly comments. Before we add our substring we make sure it’s the first occurence in the string of that particular sequence of characters (“don’t begin other substrings”). This prevents us from indexing “he” twice in the word “hehe”.

TrieIndexer

Now, to the real stuff! Actually, most of our work is done for us. So let’s declare our indexer:

 1: class TrieIndexer<T> : IIndexer<T>
 2: {
 3: private Trie _rootTrie = new Trie(/*null char goes here--HTML doesn't like it/*);
 4: private int _maxKeyLength = 3;
 5:  
 6: public TrieIndexer(int maxKeyLength)
 7: {
 8: _maxKeyLength = maxKeyLength;
 9: }
 10: //..methods to follow...
 11: }

We have our root trie node with a null character, and the constructor takes the maximum key length (which corresponds directly to the height of trie).

AddItem

 

 1: public void AddItem(string key, T value, UInt32 sortOrder)
 2: {
 3: string toAdd = key.ToLower();
 4:  
 5: toAdd = RemoveUnneededCharacters(toAdd);
 6: 
 7: IndexStruct indexStruct = new IndexStruct(toAdd, value, sortOrder);
 8:  
 9: List<string> subStrings = GetSubStrings(toAdd);
 10: 
 11: foreach (string ss in subStrings)
 12: {
 13: Trie currentTrie = _rootTrie;
 14: for (int i = 0; i < ss.Length; i++)
 15: {
 16: char c = ss[i];
 17: Trie nextTrie = null;
 18: if (!currentTrie.Next.TryGetValue(c, out nextTrie))
 19: {
 20: nextTrie = new Trie(c);
 21: currentTrie.AddNextTrie(c, nextTrie);
 22: }
 23: currentTrie = nextTrie;
 24: }
 25: currentTrie.AddValue(indexStruct);
 26: }
 27: }

Lines 3-10 are the typical preprocessing of the key, and creating a structure for it and the value to live. The fun stuff happens in 12-30. With each substring, we begin at the root, and try to branch out node-by-node, character-by-character to find the bottom-most trie in which to place our indexed value. If the next one doesn’t exist, we create it. Once we get to the bottom of the tree for this subkey we add our value to the structure.

Lookup

Now let’s turn our attention to the filtering part. It’s almost like adding new values:

 

 1: public IList<T> Lookup(string filterText)
 2: {
 3: Trie currentTrie = _rootTrie;
 4: 
 5: int maxTrieLength = Math.Min(filterText.Length, _maxKeyLength);
 6: for (int i = 0; i < maxTrieLength; i++)
 7: {
 8: Trie nextTrie = null;
 9: if (currentTrie.Next.TryGetValue(filterText[i], out nextTrie))
 10: {
 11: currentTrie = nextTrie;
 12: }
 13: else
 14: {
 15: //no results
 16: return new List<T>();
 17: }
 18: }
 19: Debug.Assert(currentTrie != null);
 20: 
 21: return GetResults(filterText, currentTrie);
 22: }

We set our current node to the root. We then figure out the length of the path we need to traverse. If our filter text is longer than our maximum key length, we only want to go as far as the key length (and vice versa).

We do the same type of lookups as in AddItem, but this time if the next node isn’t present, we just return an empty list–there were no results for that filter text.

Once we find the target node, we call another function to actually compile the results for us:

 

 1: private IList<T> GetResults(string filterText, Trie trie)
 2: {
 3: List<T> results = new List<T>();
 4:  
 5: IList<IndexStruct> preResults = trie.GetCombinedResults();
 6: if (preResults.Count <= 0)
 7: {
 8: return results;
 9: }
 10: results.Capacity = preResults.Count;
 11: uint prevSortOrder = 0;
 12: foreach (IndexStruct item in preResults)
 13: {
 14: Debug.Assert(item.sortOrder > prevSortOrder);
 15: prevSortOrder = item.sortOrder;
 16:  
 17: if (filterText.Length <= _maxKeyLength || 
 18: item.key.IndexOf(filterText, 
 19: StringComparison.InvariantCultureIgnoreCase) >= 0)
 20: {
 21: results.Add(item.val);
 22: }
 23: }
 24: return results;
 25: }

In line 5, we call GetCombinedResults() and store it in the variable preResults. Why preResults? Why aren’t they final? For one, they are the entire IndexStruct object, and we still need to extract the values. Secondly, it’s possible the user entered a longer filter text than we indexed, so we still have to do a linear walk of the list and do string searches to make sure the preResult is really valid. Thankfully, we can avoid this if the filterText is short.

Summary

And now…we’re done! This one was a beast! There are still some optimizations to be found in here, but it’s pretty good. OK, so what about performance of this thing?

Let’s run it on lesmis.txt first:

testsearch_lesmis_trie

OK, for raw speed it’s actually slower than naive–overall. But look closer. Nearly ALL the time penalty is coming from the lookup of “h”. The next results are over ten times faster than naive. Now, let’s compare to the substring method. The trie method is overall slower still, and the search times are comparable (except for the initial search). But, look at memory usage: the trie method uses less than HALF what the substring method used. Also, the index creation time is 4 times faster. Mixed bag, but impressive none-the-less.

Now let’s run it on huge.txt:

testsearch_huge_trie

Ouch, over a second–but again, it’s all because of that initial search for “h”. All the other times are about the same. Comparing to the substring method, it uses almost a third of the memory, and takes a third of the time to create the index.

So what can we say about this method?

Pros:

  • finding 0 results can be VERY fast (3-4 lookups in hash-tables to determine if a short filter text isn’t present).
  • memory use is much better than enormous hash tables.
  • Index creation time is better than other method.
  • Searching for filter texts longer than a single character can be very fast.

Cons:

  • short filter text is bad–has to combine lots of trie nodes.

(Note: in the course of writing this article, I changed my implementation to no longer need the Finish() function, which was part of the IIndexer<T> interface. I know I promised we’d use it, but I don’t need it!)

Download project files

Next time…

So where do we go from here? I have some notes about implementing this paired with a ListView control. Stay tuned for part 4!

 

Instant Searching and Filtering in .Net – Part 2

This is part two of my series on fast searching/filtering of text using C#.

In the previous article, we developed the filtering interface, built up a testing framework and implemented a naive indexer. For many purposes, that indexer performs more than adequately. Still, there are other possible implementations that might work better (or not…let’s wait and see).

SubString Indexer

In this installment, let’s develop something based on hash tables. With O(1) look-up time, they could be the ticket to blazing fast lookups.

We reuse the same internal structure as the NaiveIndexer, except I’ve changed it to a class. It needs to be reused many, many times for the same value so it’s much more efficient to share the object around instead of make copies:

   1: private class IndexStruct
   2: {
   3:     public string key;
   4:     public T val;
   5:  
   6:     public IndexStruct(string key, T val)
   7:     {
   8:         this.key = key;
   9:         this.val = val;
  10:     }
  11: };

The fields are thus:

   1: private int _maxKeyLength = 999;
   2: private Dictionary<int, List<IndexStruct>> _hashes;

Notice again that we still don’t have to keep track of sort order in the IndexStruct. This is because down at the bottom of the data structure, all the data is still stored in List<> objects. I’m still breaking the law of abstraction by associating sort order to the order in which I add items, but…hey, it’s just an example.

Constructor

Let’s take a look at the constructor and figure out what the maximum key length is for, and a few other issues we have to handle:

   1: public SubStringIndexer(int numItems, int maxKeyLength)
   2: {
   3:     _maxKeyLength = maxKeyLength;
   4:     _hashes = new Dictionary<int, List<IndexStruct>>(numItems);
   5: }

As you can see, the constructor takes two arguments.

The number of items is used to initialize the Dictionary<> ( a hash table class) to the number of items we can expect. If you know you’re hash table theory,  you know that the size of a hash table should ideally be a prime number much larger than the number of items you want to insert. I have not bothered to do that here and experimentation did not show a significant benefit. Also, the hash table will automatically expand itself anyway for a certain load factor.

The maxKeyLength parameter is crucial. Since this indexer works by calculating substrings, it’s important to specify just how big those substrings can be. There’s a tradeoff here. The longer the maximum, the more substrings can be precalculated, and the faster the searches will be. However, you pay an enormous price in memory usage. We’ll see that price below when we run this example. I’ve chosen 3 as a fairly good balance between speed and space.

Helper Functions

Before we override our interface methods, let’s define some helper functions we’ll need. The first is something we’re familiar with:

   1: private string RemoveUnneededCharacters(string original)
   2: {
   3:     char[] array = new char[original.Length];
   4:     int destIndex = 0;
   5:     for (int i = 0; i < original.Length; i++)
   6:     {
   7:         char c = original[i];
   8:         if (char.IsLetterOrDigit(c))
   9:         {
  10:             array[destIndex] = c;
  11:             destIndex++;
  12:         }
  13:     }
  14:     return new string(array, 0, destIndex);
  15: }

Just as with the naive indexer (in fact, with all the indexers), we need to strip out unimportant characters.

One of the most important functions we’ll need is something to generate substrings given a key.

   1: private List<string> GetSubStrings(string key)
   2: {
   3:     List<string> results = new List<string>();
   4:  
   5:     for (int start = 0; start < key.Length; start++)
   6:     {
   7:         /*get maximum length of substring based on current
   8:          * character position (constrain it to within the string
   9:          * and less than or equal to the maximum key length specified
  10:          * */
  11:         int lastLength = Math.Min(key.Length - start, _maxKeyLength);
  12:  
  13:         /* Get each substring from length 1 to lastLength
  14:          */
  15:         for (int length = 1; length <= lastLength; length++)
  16:         {
  17:             string sub = key.Substring(start, length);
  18:             if (!results.Contains(sub))
  19:             {
  20:                 results.Add(sub);
  21:             }
  22:         }
  23:     }
  24:     return results;
  25: }

GetSubStrings returns a list of all substrings of length 1 to _maxKeyLength. If you think about it, you can see why limiting this number to a small number is a good idea. If you have thousands of different keys, each of a fairly sizable length, you will generate thousands and thousands of unique substrings, not to mention how long it will take (a very long time).

Now let’s look at how this indexer works.

Overview

Here’s the way it works. Each substring of a key is converted to a hash number, which is the index into the hash table. The hash of the substring is used instead of the substring itself to avoid storing the substrings in the hash table’s list of keys–just for memory reasons.

The value of each slot in the table is a list of items. Lookup works by first narrowing down the list by doing a hash lookup on the filter text, then doing a linear search through all the items returned from the hash table. We’ll see the details below.

AddItem

   1: public void AddItem(string key, T value, UInt32 sortOrder)
   2: {
   3:     string toAdd = key.ToLower();
   4:  
   5:     toAdd = RemoveUnneededCharacters(toAdd);
   6:  
   7:     List<string> subStrings = GetSubStrings(toAdd);
   8:     IndexStruct indexStruct = new IndexStruct(toAdd, value);
   9:     foreach (string str in subStrings)
  10:     {
  11:         List<IndexStruct> items = null;
  12:         int hash = str.GetHashCode();
  13:  
  14:         bool alreadyExists = _hashes.TryGetValue(hash, out items);
  15:         if (!alreadyExists)
  16:         {
  17:             items = new List<IndexStruct>();
  18:             _hashes[hash] = items;
  19:         }
  20:         items.Add(indexStruct);
  21:     }
  22: }

After normalizing the key (lower-case, alphanumeric), we get the valid substrings. For each of those, we calculate it’s hash and try to look it up in our hash table. If it doesn’t exist, we create a new list for that substring. Then we add the new entry to that list.

Lookup

Lookup is a little more complicated, but still straightforward enough.

   1: public IList<T> Lookup(string subKey)
   2: {
   3:     string toLookup = subKey.ToLower();
   4:     List<IndexStruct> items = null;
   5:     List<T> results = new List<T>();
   6:     int hash = 0;
   7:  
   8:     if (subKey.Length > _maxKeyLength)
   9:     {
  10:         /*
  11:          * If the substring is too long, get the longest substring 
  12:          * we've indexed and use that for the initial search
  13:          */
  14:         toLookup = toLookup.Substring(0, _maxKeyLength);
  15:         hash = toLookup.GetHashCode();
  16:     }
  17:     else
  18:     {
  19:         hash = toLookup.GetHashCode();
  20:     }
  21:  
  22:     bool found = _hashes.TryGetValue(hash, out items);
  23:     if (found)
  24:     {
  25:         results.Capacity = items.Count;
  26:         foreach (IndexStruct s in items)
  27:         {
  28:             /*
  29:              * Have to check each item in this bucket's list
  30:              * because the substring might be longer than the indexed
  31:              * keys
  32:              */
  33:             if (s.key.IndexOf(subKey, StringComparison.InvariantCultureIgnoreCase) >= 0)
  34:             {
  35:                 results.Add(s.val);
  36:             }
  37:         }
  38:     }
  39:     
  40:     return results;
  41: }

We first convert the subKey parameter (what we’re searching on) to lower case to normalize it. If that text is longer than the maximum subkey we’ve indexed, we trim it down to match the maximum size. Then we calculate the hash code and see if it’s in our list. If it isn’t found, there are no results and we return the empty list.

If a list was returned, we still have to go search through the entire list and do string searches to make sure our entire subkey is present in the key before adding it to the result set.

(If we needed to be concerned about sort order, we would do another post-processing step and sort the results list by sortorder.)

Testing

So let’s see how this works in practice by running it against the same lesmis.txt file.

testsearch_capture2

It’s nearly 3 times faster! But at what cost? It now takes about 8 seconds to create the index in the first place, and it uses 31 MB of memory. Ouch!

But I wonder….what if I run this against truly huge data sets…

I create a million-line file out of various books available at the Gutenberg project. First, let’s run the naive indexer:

testsearch_huge_naive

Now the naive way is taking over a second to do the search. What about our new substring indexer?

testsearch_huge_substring

It takes nearly a minute to create the index, but lookups are now almost 5 times faster than the naive version–it definitely scales better. Well, except for that 270 MB index size!

Summary

So, this is an improvement in some ways, but it has some big costs (index creation time, index size). Next time, I’ll show yet another way that has some advantages.

Download sample project.

 

Instant Searching and Filtering in .Net – Part 1

Even though I recently wrote about just using naive algorithms when they’re sufficient, it helps to know about other options and their characteristics. With that mind, I’m beginning a little series (4 parts planned–I’ll update this list as I go along) in C# documenting how I developed a few different approaches to doing fast (instant) filtering of large lists.

The article index:

  1. Getting Started – Indexing interface, test driver, and naive algorithm (this article)
  2. SubString indexer
  3. Trie indexer
  4. Efficient usage of ListView with filtering

Why Filtering?

I recently changed some internal apps at work do use filtering instead of column sorting of large listviews. It took a few days for the users to get used to it, but I’ve gotten comments back that it is amazingly better. The reason is that you are hiding all sorts of data you don’t need to look at. This is better in many cases than sorting, especially if you’re sorting 10,000 items. It’s also fairly intuitive for users to use.

The Requirements

Our indexer must have certain capabilities, so says me:

  1. Store any type of data
  2. Keys are strings
  3. Searches can be done on subkeys (i.e., if a key is “valjean”, doing a search for “lje” will work.
  4. Maintain sorted order, if desired

The Interface

The first thing we need to do is define a common interface that all our indexers will implement. This will make it very easy to swap them out when comparing different implementations.

Before showing the code, let’s discuss exactly what the indexer needs to do. Basically, the indexer has to accomplish two tasks:

  1. Index an item according to its key
  2. Lookup a key or subkey and return a list of items

We’ll assume that all keys will be strings because the point here is to do search, and search require strings. The actual values stored in the index, however, can be anything, so let’s make sure the interface supports generics.

Another potential problem is sort order. The internal index representation may not respect the order (for example, if a hash table were used, items are definitely not maintained in a sorted order). Our interface should provide a way to handle this.

It’s also possible that an indexer may want to know when indexing is finished so it can clean up any temporary data it may have created. Our first example will not use this, but we’ll put it in the interface now anyway.

So without further ado, here is our interface in all her glory:

   1: interface IIndexer<T>
   2: {
   3:     void AddItem(string key, T value, UInt32 sortOrder);
   4:  
   5:     IList<T> Lookup(string subKey);
   6:  
   7:     void FinishIndex();
   8: }

Just three tiny functions. That’s all we need to get started.

Naive Indexer

With our interface designed we can quickly build a simple indexer. The algorithm for this one is brain-dead simple:

  • Store key/value pairs in a list when added
  • When a search is done, loop through all keys and do a substring lookup on each key. If substring is in key, add that value to a list. Return the list when done.

I did say this was naive.

Let’s start our implementation by deriving a new class from IIndexer<T>:

   1: class NaiveIndexer<T> : IIndexer<T>
   2: {
   3:     
   4: }

No constructor is needed so let’s jump into the data structures required. For one, we need to store the key and value we’re adding, so let’s make a private structure to hold those together as well as a member variable list of structures to hold our data.

   1: private struct ItemStruct
   2: {
   3:     public string _key;
   4:     public T _value;
   5:     //public ushort _sortOrder;
   6: };
   7:  
   8: List<ItemStruct> _items = new List<ItemStruct>();

I include the sort order merely to show where it could go. I’ve left it commented it out in my implementation because I’m adding things in sorted order and the List<T> will keep things sorted for me. This is probably breaking the abstraction, but it’s easy enough for you to add it back in if you want. (Like I said, I wanted this to be as easy and light-weight as possible, so a leaky abstraction is acceptable in this case).

With these data structures, adding a new item is a piece of cake. We just create a new instance of the structure, set the fields, and add it to _items.

   1: public void AddItem(string key, T value, UInt32 sortOrder)
   2: {
   3:     string realKey = RemoveUnneededCharacters(key);
   4:     ItemStruct itemStruct = new ItemStruct();
   5:     itemStruct._key = realKey;
   6:     itemStruct._value = value;
   7:     //itemStruct._sortOrder = sortOrder;
   8:     //could insert into place of sortOrder, after a grow, if desired
   9:     _items.Add(itemStruct);
  10: }

Woah, hold on! RemoveUnneededCharacters? Well, in my application I only wanted to index alphanumeric characters. This function strips all others from a string and returns the “real” key to use. Of course, during lookups, you’ll have to be similarly careful to sanitize the input to prevent searching on stripped characters.

   1: private string RemoveUnneededCharacters(string original)
   2: {
   3:     char[] array = new char[original.Length];
   4:     int destIndex = 0;
   5:     for (int i = 0; i < original.Length; i++)
   6:     {
   7:         char c = original[i];
   8:         if (char.IsLetterOrDigit(c))
   9:         {
  10:             array[destIndex] = c;
  11:             destIndex++;
  12:         }
  13:     }
  14:     return new string(array, 0, destIndex);
  15: }

I think that’s a fairly efficient way of stripping characters that doesn’t rely on creating temporary strings, but if you have a better way, I’d love to see it.

Doing searches is similarly straightforward. We loop through each ItemResult in _items and see if our subkey is in the key. If it is, add it to a result list. After all are searched, return the result list.

   1: public IList<T> Lookup(string subKey)
   2: {
   3:     List<T> results = new List<T>();
   4:     foreach (ItemStruct itemStruct in _items)
   5:     {
   6:         if (itemStruct._key.IndexOf(subKey, 
   7:             StringComparison.InvariantCultureIgnoreCase)>=0)
   8:         {
   9:             results.Add(itemStruct._value);
  10:         }
  11:     }
  12:     //Sort on results[i]._sortOrder, if desired
  13:     return results;
  14: }

And that’s it! We now have a fully-functioning naive indexer. Now, let’s see how can test it and set the foundation for comparison of all the indexers we’re build.

Test Harness

I’m not going to include all of the code for the test harness here–you can find it in the sample code download. A simple description of it will suffice.

The test harness will take as input a filename, the number of items to index, what to search for (or filter on), and the indexing method. This will allow us to easily add other indexing methods as we develop them. Of course, being a simple test harness, there is no error-handling.

The items to be indexed will be lines from a text file. I’ve supplied the text of Les Misérables by Victor Hugo from Project Gutenburg, but you can use any text file you want (the larger the better). Les Misérables has about 70,000 lines. This is actually fairly small-medium.

Since, the reason for creating these indexers arose out of search-as-you-type functionality in one of my apps, the harness does progressive lookups on successive substrings of the key you type. I.e., if you do a search on “valjean”, it will first search “v”, then “va”, then “val”, etc.

Here is a screenshot of a sample run (click to enlarge):

testsearch_capture

 

Optimization

I mentioned above that the need for this arose from doing filter-as-you-type functionality in one of my applications. This realization can lead to a major optimization, which I have not implemented in this example. If you’re doing successive searches, first searching for “v”, then “va”, then “val”, etc. You can cache the search filter and results of the previous query, and then on the next search instead of looking through the entire _items list, you can just look through the cached results instead. First, you just check to make sure that the previous filter is a substring of the current filter.

Summary

Next time, I’ll develop another indexing method that has some additional advantages and disadvantages. Using the test hardness, we can compare the different algorithms under different conditions. Stay tuned.

(I’m out of town this weekend, so comments will be approved after I return home)

Sample Project Download

Les Misérables

 

Serial Port Code Library

Need to interact with some hardware via serial port? I’m having to do some serial port communications in a project at work, and it is officially Not Fun. There was some original code that was doing only reads from the port, and I needed to some significant functionality that would not be possible with the existing code. So rather than writing a ton of code from scratch (well, other than what I will have to do anyway), I tried to find some open-source, commercial-friendly serial port libraries on the Internet.

The only one I could find that seemed to be perfectly suited to the task at hand was Serial Library for C++ by Ramon de Klein. It has a number of classes for various ways of accessing the serial port and handling events, interacting with Windows. It also supports MFC.

The license is LGPL, which makes it’s ok for me to use in the project. As Ramon’s article notes, doing serial port communications from scratch is hard. So reuse what is out there.

If you’re in .Net, you already have some nice classes in the form of System.IO.Ports.SerialPort, which was added in 2.0.

Technorati Tags: , , ,

A Visual Studio that’s easier on the eyes

After you’ve looked at Visual Studio all day for a few days in a row, the brightness of the white background can really start to bother you, especially as LCD monitors get brighter and brighter. That’s why I’ve become a big fan of Dave Reed’s Dark Side theme for Visual Studio 2005. It took me a few hours to get used to it, but now I’m hooked.

The only thing I changed was the font. I really like Consolas, size 13. Courier New is Courier-Old-and-No-Longer-Used.

At work, I still have to use VS2003 for a project, and keeping it in the older, white background really helps me distinguish which environment I’m in.

The users are in control

I really enjoyed and appreciated this essay from Raganwald about the user experience at work versus that of their home PC environments (among other topics).

I particularly liked the point:

And meanwhile, the very same users could walk across the street and buy themselves a much better PC for less money than we pay and take it home the same day.

Ain’t that the truth. I put together my Core 2 Duo system for the same price as my crappy Pentium 4 hyperthreaded number at work. The time frames were not that far apart. The Core 2 runs circles around this sick puppy.

A company’s philosophy should be to get users (especially developers like me!) whatever hardware/software they need immediately. Within minutes or hours, not days or weeks. Of course, then you have to trust your employees to make good requests. But if you don’t trust them to know what they need, why trust them to do their job at all?

The essay goes on to talk about writing applications that take advantage of modern PC horsepower. I think I’m doing an ok job of this at work now. For example, we have a database of assets that is continually growing. It used to be we could view all of the assets on a single page that took about 30 seconds to load off-site.

Now that list will take several minutes to bring up. Yeah, we’re growing. So we need tools to help manage all of that information. One thing I’m building right now (as soon as I’m done writing this, as a matter of fact) is a quick filtering functionality on a desktop app that talks to the database. The list of assets is filtered as you type, taking advantage of the fast PCs we have these days.

That’s just one example. I can think of others that are immediately useful in business apps:

  • better visualization – it takes time and thought to develop good data visualization, but the results are usually worth it
  • drag & drop support – make sense to drag assets from a customer to another? I don’t know, maybe.
  • dynamic views – use all that processing power to show something more interesting than fields on a scrolling form. Graphics views that change in response to context
  • track history, undo/redo – might make sense in some contexts
  • attach more meaningful information – pictures, videos, documents, whatever. – with stuff like WPF, it’s easier than ever to display varied content

Technorati Tags: , , ,

How to measure memory use in .Net programs

In developing an important component (which I will discuss soon) for my current personal project, there were a number of different algorithms which I could use to attack the problem. I wasn’t sure which one would be better so I decided to implement each of them and measure their time/memory usage. I have a series of articles planned on those, but first I need to mention how I did the measurement.

 

For the timing, I simply wrapped the API functions for QueryPerformanceCounter and QueryPerformanceFrequency into a C# class. There are plenty of examples out there on the net to do this.

The memory usage is even simpler. The function you need is GC.GetTotalMemory(bool forceFullCollection). This function returns the amount of memory allocated. The little program below demonstrates how to use it.

using System;

namespace MemUsageTest
{
    class Program
    {
        static void Main(string[] args)
        {
            long memStart = GC.GetTotalMemory(true);
            Int32[] myInt = new Int32[4];
            long memEnd = GC.GetTotalMemory(true);

            long diff = memEnd - memStart;

            Console.WriteLine("{0:N0} bytes used", diff);
        }
    }
}

The output on my 32-bit system is “40 bytes used”–16 bytes for integers and 24 bytes of array overhead.

Passing true to GetTotalMemory is important–it gives the garbage collector an opportunity to reclaim all unused memory. There is a caveat, though. If you read the documentation, you’ll notice it says it attempts to discover the amount of bytes in use, and that setting forceFullCollection only gives it an opportunity and waiting a bit before returning. It does NOT guarantee that all unused memory is reclaimed.

As far as my work goes, I’ve noticed that it does a pretty reliable and consistent job.

Technorati Tags: , ,

Thoughts on Process: Automation (and examples)

If a process, or part of a process, can be automated it should be. For example, in a project at work, part of our process is to make sure that every dialog present in the English resources of an MFC application is also present in all the other dialogs. We do this manually by loading each language into the app and triggering every dialog. This is tedious, time-consuming, and error-prone. Did I mention we have 12 languages?

A better way would be to have a utility that quickly parses a resource file and extracts the dialog names and compares that list to all the other languages. It could even by expanded to do a control-by-control comparison to make sure all of those were present.

At some point, of course, the dialogs have to be visually inspected for spacing issues in other languages, but for an automated check, it can find the most common errors pretty quickly.

I am a huge fan of automating processes like this. There’s no reason to waste brainpower on highly-repetitive tasks. Even if it takes me a whole day to write a tool, it’s usually worth it.

Examples of automated processes:

  • Build process – if you’re building and packaging your product manually, you’re doing it wrong. All modern environments include command-line, scriptable components. When we build our flagship product, I type “make” and ten minutes later there’s a setup.exe in a distribution folder for our team.
  • Unit testing – Unit testing suites are usually automated to some degree (you hit a button, all tests run), but it can be taken further by running them during builds or as part of a continuous integration server.
  • Documentation/Change-logs – whenever we produce a new build, we send out an e-mail to internal staff about the changes. These are usually culled from the check-in comments in subversion. It’s a manual process. It might be nice to automatically dump them to a file, which can then be edited instead of written from scratch.
  • Code checks – this can encompass almost anything, but having static analysis tools is invaluable. Like the utility that I mentioned earlier to compare resource files across all the languages we support, it can save tons of manual, “stupid” labor.
  • Loading source onto a new machine – How long does it take to get up and running on a new machine? Admittedly, you shouldn’t be switching machines all THAT often, but when you do how easy is it to grab the source and start debugging? Are all the required libraries and tools in source control and automatically configured by build scripts?
  • E-mail – If you’re like me, you get scores if not hundreds of e-mails a day. How are you organizing, sorting, responding, ignoring, deleting them? Setup filters to put them into different folders, highlight or tag them when certain keywords appear. Also, get Google Desktop Search or Windows Desktop Search. I like both of them, but I’m currently using Google’s version. I may switch back in a while.
  • Bug reporting – While not strictly about automation, I think it’s close enough. Reporting bugs and code changes in a text file is good for a while–if you’re the only one working, and you have a small number of them to deal with. Once you start involving more programmers, and perhaps a manager who wants to see some basic reports, the text file doesn’t cut it. Get a simple bug reporting tool. I use BugTracker.Net because it’s easy, simple and does exactly what we need with minimum fuss. How do I know what to work on? I open up a web page and it tells me. I’ve automated not only some manual labor, but also some needless thought processes.
  • Calendaring – Do you need to write a weekly report for your manager? Keep track of employee’s vacation schedules? Use Outlook’s (or whatever PIM you choose) task list and calendar for anything you need to remember about a specific date. Set reminders for when you need to think about them, and then forget about them.
  • Data production – if you’re in a production environment generating data that needs to be analyzed, create tools to do as much of it as possible. Of course, the tools need to be checked for correctness, but once you’re confident, do it and don’t look back.

There are many, many ways you can optimize, reduce, and automated the work you’re doing. Remember, the whole point is to get rid of the “dumb” work and let yourself concentrate on the important, creative things.

Technorati Tags: , , , , , , ,

More on Google interview

This a follow-up to my previous post about my interview process with Google. Once a post gets as long as that one did, I’m sure to forget to say some things. Rather than updating that post, I thought I had enough new to say to warrant a new post.

First is the picture I got of their development process. There are plenty of other places on the Internet about their development process, so I won’t go into detail about what they told me–it pretty much matches up with the available information. It really sounds like they try to match the amount of process required to the specific project at hand. Projects with a huge public impact have lots of process (Google’s front page, indexing, etc.), while those that are newer and much lower impact (stuff in the Labs section, and even graduates of the Labs) have a much more flexible, agile process, designed to get improvements out the door very quickly. I like that–no mandatory bureaucracy where it doesn’t make sense.

Aside from process, however, it seems that they are very intent on giving developers an environment designed to help them succeed. From what I understood, the company actively tries to remove stupid barriers to productivity (needless paperwork, poor IT, bad workstations) and give you whatever you need to do your job how you think best. Obviously, there are rules and standards, but it just sounded more flexible. It really sounded like an ideal development environment: Obstacles removed, needs granted. Now, how much of that is the official “show” they put on for all interviews, who knows, but Google is obviously doing something right.

Bottom-line is that Google is a company of engineers for engineers. They’re the ones in charge of what the company does. That is a very nice place to be if you love coding.

Also, I should mention that the Google Boston office is MUCH smaller than their Mountain View headquarters. The way things are done, while it will still be “Googly”, will most likely have a different feel and pace than at headquarters. I had read many reports on the web about how people worked late hours, on weekends, and basically sacrificed their lives for the company. I did NOT get that impression in Boston. They were definitely smart and very hard working, but it sounded more like the company was flexible and if you got your work done, who cares? (That’s the way things ought to be done for sufficiently self-motivated employees). I did ask about inordinate over-time (mistake on my part?) and work-life balance and I came away with a satisfactory impression. Whether this means Boston is special, or the accounts I read on the Internet were not representative, I don’t know. Probably a lot of the latter, for sure.

I also wanted to address my final link in my last post. I know it can be a little disappointing to read that kind of post and realize it’s not talking about you, because you’re interviewing for jobs. I wouldn’t take it too literally. Maybe my link text is a little black and white. I think the principle is definitely valid, though. The better you are, the more freedom you have to choose where you work and what you work on and the less chance your going to fall into a company’s hiring process. It’s really more about statistics from a company’s point of view of finding the best, not necessarily for individuals.

Hopefully, that’s all I have to say on the subject, but if you have questions, just leave them in the comments and I’ll try to answer them!

My interview experience with Google

(See also part 2 of this article).

A few months ago I received an e-mail from a recruiter at Google asking for an opportunity to talk to me about available development positions. Needless to say, I was pretty excited. I’m fairly happy in my current job, but–it’s GOOGLE. You don’t say no to an interview opportunity at Google.

Like this article? Check out my latest book, Writing High-Performance .NET Code.

I’m writing this account in order to contribute to the meager resources available on the Internet about the Google interview experience. I signed an NDA, so I’m not going to say what the specific questions were, but I think I can give a pretty good idea of my experience. I apologize right now for the length.

I traded a few e-mails with a recruiter in Mountain View. I had a phone conversation with him, wherein he asked me general questions about my skills, desired work locations (giving me a choice of Santa Monica, Mountain View, and Boston). I have no desire to live in California, so I chose Boston. I was then passed to another recruiter, who setup a phone interview with an engineer in Mountain View. There was a false start, when they couldn’t do the interview at the original time, so we postponed.

The phone interview went very quickly. He was very nice and asked about my specific talents, things I enjoy doing, and projects I’d worked on–especially those I listed on my resume. He asked about the ray tracer I wrote in college, since he had an interest in that. He also asked some general questions about the stuff I do for work. Then he got into the technical question. It was an interesting problem, and I asked follow-up questions, talked out loud, wrote things down in front of me (and told him what I was writing and why). I immediately thought of the naive solution–always a good place to start. He was interested in the asymptotic complexity. I knew there were better ways of doing it, so I started thinking of optimizations to the algorithm, trying to come up with ways of caching information, reusing previously-computed values, etc. He gave me some gentle prodding, and I think I understood immediately where he was going. I answered the question fairly well, I though.

And that was it–just a single question. I was surprised. The entire thing lasted less than 30 minutes. I was almost disappointed, and thought–“well, that’s that–I won’t hear back.” I really wasn’t expecting any follow-up.

The next week, I got an e-mail from my recruiter who said I had impressed and was going to get the opportunity for an in-person interview in Boston! They hooked me up to a travel coordinator, as well as the recruiter in Boston.

Very exciting. I had a convenient time to go, so I set that up, took time off from work and went up to Boston, staying in the Cambridge Marriott. Very nice hotel. 40″ flat screen TV in the room ( which I never turned on). All expenses paid for, of course. 🙂 I did have to pay for hotel and food up front, and save the receipts. (And yes, I promptly received a reimbursement check from them a few weeks after I sent them in.)

I arrived on Monday afternoon, figured out Logan International (a very confusing airport, I thought), and got myself to Cambridge, in the heart of MIT, an hour or so later. I checked in, then went walking. I found the building Google is in on the very next block from the hotel. They have a floor in a building that MIT leases to startups, tech incubators, and the like. There are plenty of news articles about the Google Boston office–just…you know, Google for them.

I walked past the ultimate geek bookstore–

Quantum Books. Discount tech books. COOL. I would definitely have to stop there later. Then I got some cheap, awful Chinese food at the food court

right under the hotel. Why? When I could go out on Google’s dime? I think I was just tired and wanted to get back to the hotel soon and start studying.

I ate dinner in the room, took pictures of the wonderful view of the Boston skyline.

Boston Skyline (Day)

Boston Skyline (night)

Studying

What did I study? I brought two books with me: Robert Sedgewick’s Algorithms in C++, and a C++ reference manual. I went over advanced C++ topics, STL, simple sorting and searching algorithms, properties of graphs, big-O, and anything else that popped into my head.

By far the most valuable thing I did was write out algorithms before-hand. I picked something and wrote it out by hand in a notebook. It was hard going at first, but I knew it was the best thing I could do to prepare. I did selection and insertion sort in both array and list form. I did string reversal, bit reversal, bit counting, and binary search. All by hand, without looking in a book at all. As well you might know those simple algorithms, it’s harder than it sounds.

I went to bed relatively early–9:30, and woke up the next morning at about 6. I went to breakfast in the hotel restaurant, got a rather large meal, and then headed to my room to study more. I wrote more algorithms and redid some I had done the previous night.

Oh, I also wrote down in my notebook (beginning on the plane ride up) questions for Google, as well as answers to questions they could ask me (standard interview fare–projects, favorite project, languages, strengths, passions, getting along with people).

My interview was scheduled for 10 am–I checked out at 9:30 and left with my bag (I had everything in a single bag I could carry–it was very heavy) and sat in a little square for a few minutes. At about 9:50, I went in, took the elevator, and was greeted with:

.

. (ready for it?)

.

The Google

Dr. Seuss land! Yes, that was my first thought. I think the door was green, the reception area was very colorful. The receptionist was very nice and asked me to sign in on a computer, which printed a name badge for me. They had some research papers by Google employees on a wall, so I grabbed a couple (their hard drive failure study, and map/reduce). After a few minutes, my Boston recruiter came out and greeted me, offered me a drink from their free fridge, and took me to a small conference room, furnished, it appears, from Ikea. It was simple, clean, and very nice. There was a white board. I would get to know that whiteboard very well.

My first interviewer came in and we got started. I talked about my projects for a bit, they answered my questions, and then we got to the problem. Each interviewer asked me to solve a single problem (could be some sub-problems involved), and I had to do it on paper or on the board. I wrote C/C++ code. They take note of what you write and ask you further questions, especially if your first solution isn’t optimal.

I tried to take extra care with my code and not let stupid mistakes creep through. I used good variable/function names, made sure my braces matched, and I ran through the algorithm on the board once I had written it.

The first interview was one of the toughest. I was more nervous. I think I made more mistakes–I didn’t see things as quickly as I did later.

I had three interviews before lunch. They then handed me off to someone else who would not be evaluating me, but would just be an escort for lunch. The Google cafeterias in Mountain View are legendary, but the Boston office is far too small to warrant such lavishness. Instead, they have a catered lunch every day. It was wonderful. They also have all the free drinks and candy you could want, available all the time. I spent much of the time asking my escort questions about Google, what he was working on (couldn’t tell me), the area, the office, the commute. We were also joined by the head of the office, who didn’t realize I was an interviewee, and we had a nice conversation as well.

Lunch was an hour, and then I was back in the conference room. I had two more interviews. Then the recruiter came back in at about 3:15 or so and debriefed me–asked me my impressions, how I felt. I reiterated what I had told him on the phone previously, and that morning before we started: that I was trying to take this as easy and nonchalantly as possible, to have fun, and learn, and let it be a good experience. I had a job that I enjoyed, and didn’t NEED this one, but I think I would do well there and enjoy it very much. They came to me, after all.

I think by the end of the day, I was really pulling that off well. Once I got over my nervousness in the first interview, I really did have fun and enjoy it.

General Notes

They didn’t ask me any stupid questions. None of this “what’s your biggest weakness?” garbage. Not even the recruiter asked me anything like that. Nothing silly at all. They also didn’t ask me easy technical questions. They got right into the problems and the code. I had to describe an algorithm for something and write code for it. It was serious, they were all nice–I met people with serious reputations online. I felt like they were respecting me as a fellow programmer–almost. I wasn’t one of them, but they really wanted to see if I could become one of them.

I did receive prompts to get through certain problems, but not overly so. I answered every question they asked. Some I answered better than others, but the ones I didn’t get right away, I had alternate solutions, and I understood where they were going as soon as they started talking about it.

Why I didn’t get the job

Well, companies these days won’t tell you why. I think they fear it opens them up to lawsuits. I think that’s silly. It prevents those of who really do want to learn and improve from knowing what we’re deficient in. Oh well. They told me that they thought I would do well at Google, but that it wasn’t a good fit at the time, and I should apply again in the future. (Of course, I didn’t apply in the first place.)

My suspicions, however, are that I lean too much towards Microsoft technologies. I do a lot of work in .Net. That’s where more and more of my experience is. I do consider myself very good in C++, but I’m doing more and more C# work. I’ve always been a Microsoft Windows developer.

I also am not really interested in web-centric technologies, and I told them so. I’m more interested in client apps on the desktop, and server apps.

Of course, it’s very possible I just didn’t answer the questions to their satisfaction–that I needed more prompting than I should have. Oh well.

It could also be that my GPA wasn’t what they wanted. I goofed off my freshman year of undergraduate work. I really hurt my grades. I came back, though, and got straight A’s for my last few years where I took the hard CS classes. I also got straight A’s in my Master’s program while I was working full time. I don’t think this was the issue, but it’s possible.

Lessons

  1. Having your own web-site is a no-brainer. Do it. Update and maintain it.
  2. Do personal projects. You must be passionate, you must love programming. You must be good at it. Not average. You must aspire to excellence and be working towards that.
  3. Know what you’re talking about it. Don’t show off–just display your knowledge as it applies to what they asked you.
  4. Use an interview with them as a learning opportunity and ensure you have a good experience regardless of the outcome.
  5. Don’t take the interview too seriously. Make sure that not everything rides on the outcome. You must be comfortable, confident. You must try for success in every possible way, but yet be completely prepared to fail–and have that be ok. This is a hard balance to achieve, but I think it can really make you have a healthy outlook and have a good time while still showing your best self.
  6. If you don’t get an offer, realize that even being selected for an on-site interview by Google puts you above the ranks of the average-good programmers. Google gets 1,500 resumes a day. You’re already in the < 1% percentile.
  7. Practice writing code by hand in a notebook. This helped more than I can express. Do stuff that’s hard that you don’t know how to do immediately. Time yourself. Make the problem more challenging. If you can’t stretch yourself, they will and you’ll probably break. They did not ask me to do any of the specific questions I had practiced–but the experience writing out and the THINKING is what helped.
  8. Be able to explain your background (90% technical / 10% personal) in a few words. At some point you’ll be asked.
  9. Have a lot of questions for people. You’re interviewing them too. Make sure they’re good questions. Asking about salary is not a good question before you’ve been made an offer.
  10. I let the interview build my own self-confidence. I have no doubt that I could walk into an interview anywhere else and it would be laughably easy.
  11. Don’t ignore obvious, simple solutions. Sometimes a table lookup is better than an O(n) algorithm.
  12. Bring a good, FUN book for the plane ride back. On the way, I focused on the interview, but on the way home I wanted to do anything but, so I had my current novel (Dickens’ Bleak House–very good book, by the way).

If you do all of those steps, it actually doesn’t really matter if you apply to Google or any other great/famous company–you’ll probably get the job you want for the pay you want anyway. Somebody, sooner or later, will come across you and give you the opportunity.

Great programmers will rarely, if ever, need to look for jobs.

I hope this long, rambling essay is helpful to some. I make no claim that my experience is typical or that I’m being completely objective. In other words, YMMV.

Read part 2 of this article