Search Part VIII: Online Compressed Prefix Tries

In the previous blog post, I discussed autocomplete with wild cards.  That implementation was based on simple prefix tries, where each character in each word maintained a separate node.  Insertion, removal, and traversal of the simplistic prefix trie is easy; We simply traversed along the path per character in the word being looked up.  If we fell off the path, we knew that the word wasn’t in the dictionary; if we didn’t, we could determine whether the word was in the dictionary by checking the whether terminal = true, and further traverse to generate a set of all suffixes with that prefix.

In this blog post, I take that implementation one step further and implement insert, remove and traverse for a compressed prefix trie.  What are the advantages of a compressed prefix trie?  Essentially, the implementation is superior in every way except complexity compared with a simple prefix trie.

See the implementation here (and feel free to play with it): Search Part VIII: Compressed/Optimized Prefix Tries

Advantages of Compression

As stated above, the advantages of compression are numerous.  Here are a few below:

  • Significantly reduce memory footprint.  In my personal tests, I end up using around 66% less memory
  • Faster traversal and results generation.  Since compressed tries can simply traverse along the value kept in each node vs having to traverse down the trie, lookup is significantly faster as well (up to 66% faster)
  • Faster construction of the trie.  The same advantages above apply.

A compressed trie

What is a Compressed Trie?

A compressed trie builds upon a normal trie by eliminating redundant edges in the trie.  In my implementation, I compress down each branch (representing each suffix).  I could go even one step farther and use pointers to existing substrings in the trie.  Using this approach, I’ve heard of compressibility/memory savings using only 16% compared with a full-blown trie.

The best way to explain how a compressed trie works is to compare the non compressed with the compressed versions (* represents terminal nodes):

2.1.7 :007 > a.prefix_trie.print


Total size: 39 nodes

Compare this with a compressed trie:

 => nil

Total size: 9 nodes

Thus, right off the bat, you can see that the size has shrunk by 30 nodes, or ~ 75%.  Of course, real memory savings is a little more vague, since each flattened node is now larger by holding a larger value.

What changed from the uncompressed to the compressed trie?  Essentially, you can see that each branch of the trie that only had one child could be compressed into the same node.  For example, pineapple is the only word that starts with p in the dictionary, thus each edge could be compressed down to one word.

In practice, a flattening algorithm would look like this:

        var node = this;
        while (node.num_children() === 1 && !node.terminal) {
            //set children to child 
            var child = node.children[Object.keys(node.children)[0]];
            var grandchildren = child.children;
            node.children = grandchildren;
            node.c += child.c;
            node.terminal = child.terminal;
            _.each(grandchildren, function(gc) {
                gc.parent = node;
            if (child.terminal) { break; }

This algorithm simply traverses up the chain, squashing any redundant edges along the way. Next, if we apply this across all leaf nodes, we can optimize the entire trie.

    this.optimize = function() {
        var self = this;
        var queue = [];
        Object.keys(self.children).forEach(function(k) {
        while (queue.length > 0) {
            var node = queue[0];
            Object.keys(node.children).forEach(function(k) {
            queue = queue.splice(1);  

The next step is to rewrite the traversal algorithm, because we can’t naively use each character in the word we are trying to look up to traverse each node.  We need a way to tell if the current node we are at is a flattened node or not.  If so, traverse into the word as far as possible and return where we all off, or if we’re done traversing, return the node.  In practice, it’s a little more complex than that, and there are lots of tricky edge cases in the implementation.  However, I’ve done the hard work for you:

/* traverses tree based on text */
    this.traverse = function(textIn) {
        var node = this;
        var path = '';
        var j = 1;
        var dist = j;
        var chars = textIn.split('').map(function(c) { return c.toLowerCase(); })
        /* traversal of compressed trie */
        for (var i = 0; i < chars.length; i++) {
            c = chars[i];
            if (node.flattened()) { // flattened
                if (node.c[j] === c) {                
                    path += c;
                    dist = j;
                } else if (j < node.c.length) { //terminated early
                    return [path, undefined, node, j];
                j = 1;
            if (node.children[c]) {
                path += c;
                node = node.children[c];
                //already added path above
            } else {
                return [path, undefined, node, dist];
        var prefix = path;
        if (node.flattened()) { 
            prefix = path.substr(0, path.length - j); 
        return [prefix, node, node, dist];

Essentially, this traversal algorithm does the following. It navigates through the trie for each character in the word being looked up. We go as far as possible until we have to fall off the trie, or have exhausted the characters in the word. After we finish this process, we return a 4-tuple (an array here) below:

# 1) the path taken thus far
# 2) the node traversed into
# 3) if no node traversed into, the last node seen
# 4) the distance into the last node seen

With this information, we have everything we need to either remove the node, insert new nodes from this point, determine whether the word exists in the trie, or continue traversing to complete the suffix set.

Online vs Offline Construction

By following this approach, we’ve derived the compressed trie data structure.  However, we arrived at an offline algorithm for actual construction; that is, we need the entire corpus of words in the dictionary to construct the trie.  Then, once we have constructed the entire trie, we need to traverse the entire trie again and flatten every branch, essentially doubling the amount of effort for trie construction.  On very large datasets, this can be costly.  This is known as an offline algorithm, or one that requires the whole dataset in order to construct the output.  Contrast this to an online algorithm, which can produce the desired output incrementally.  The implementation of the online algorithm for compressed prefix trie construction is very tricky.  There are lots of edge cases, maintaining state so we know how far into the trie we’ve gone, and taking slices of prefixes so we can continue splitting nodes and swapping children.  I’ll just describe it here, but you can see the implementation for the gory details.

Online Prefix Trie Construction Algorithm

  1. traverse as deep as we can go, using the above traverse method
  2. if we fall off the trie (i.e. traverse to a dead end)
    1. if we’re at a flattened node
      1. traverse as far into the node as possible, then create a new child node with the remainder of the word
      2. create a new child node with the remainder of the value that was in the node and point all existing children to that node
    2. if not
      1. insert a child node with the remainder of the word
  3. if we don’t fall off the trie
    1. if we reach and leaf node and have consumed all characters, simply set the node to terminal
    2. otherwise, determine how far we’ve gone in the current node
      1. create a child node with the remainder of the value of the node and point all existing children to that node
      2. create a child node with the remainder of the word


Ok so now for some real world results (Based on a ruby implementation, hopefully to be open-sourced soon):

Dictionary size: 86036.  Running through creation 5 times.
            user     system      total        real
creation: 10.880000   0.290000  11.170000 ( 11.173094)
>avg:   2.176000   0.058000   2.234000 (  2.234619)
Dictionary size: 86036.  Running through prefix traversal 1000 times.
       user     system      total        real
prefix:  3.010000   0.010000   3.020000 (  3.009824)
   0.003010   0.000010   0.003020 (  0.003010)
Dictionary size: 86036.  Running through suffix traversal 10000 times.
       user     system      total        real
suffix: 12.020000   0.030000  12.050000 ( 12.050301)
   0.001202   0.000003   0.001205 (  0.001205)

Dictionary size: 86036.  Running through brute force traversal 1000 times.
       user     system      total        real
brute force: 17.540000   0.230000  17.770000 ( 17.774003)
   0.017540   0.000230   0.017770 (  0.017774)

Dictionary size: 86036.  Running through creation 5 times.
            user     system      total        real
creation:  4.760000   0.080000   4.840000 (  4.879213)
>avg:   0.952000   0.016000   0.968000 (  0.975843)
Dictionary size: 86036.  Running through prefix traversal 1000 times.
       user     system      total        real
prefix:  2.130000   0.010000   2.140000 (  2.149070)
   0.002130   0.000010   0.002140 (  0.002149)
Dictionary size: 86036.  Running through suffix traversal 10000 times.
       user     system      total        real
suffix:  7.660000   0.020000   7.680000 (  7.706711)
   0.000766   0.000002   0.000768 (  0.000771)

You can see that compressed trie construction is > 50% faster, prefix lookups are 33% faster, suffix lookups are ~40% faster.  This is without spending much time optimizing either.

Finally, look at the memory differences:

# using the dictionary.json files

# Uncompressed
2.1.7 :005 > a.prefix_trie.size + a.suffix_trie.size
 => 575978 (# nodes)

# Compressed
2.1.7 :007 > a.prefix_trie.size + a.suffix_trie.size 
 => 229498 (# nodes)

Double this for maintaining the suffix trie as well for suffix matches.

Thus, we can see that our trie is about 60% smaller than the uncompressed trie. Awesome.

The trie in all its splendor:

I also had some fun and implemented a simple print function that lets you see what the finished trie looks like.  I’ve included a sampling here:

* - terminal node
            do-chinese languages*
... and so on

Search Series Part VII – Autocomplete with Wild Cards

In the last post, I discussed using prefix tries for fast autocomplete of words within a dictionary.  In this post, I take it one step further and share some insights that we can glean from prefix tries on how to expand the autocomplete functionality by allowing wildcard search.  You can see the sample implementation here:

Search Part VII – Autocomplete with wild cards

If you recall, prefix trie construction required linear time for construction, O(n) time where n is the number of words inserted into the trie.  We read in each word from the dictionary, and inserted each character starting from the root.  From the prefix trie, we were able to traverse the trie for any prefix, then with DFS or BFS collect all substrings stemming from the node to which we had traversed.  In other words, we were able to quickly find all strings with a prefix of m.

What if we flipped our intuition around and wanted to find all words that contained a particular suffix?  E.g. “ing”, “tion”, “lion”?  We can use the same approach for building the prefix trie to build a suffix trie.

Suffix Trie

A suffix trie follows the same structure as a prefix trie, with nodes each containing a value of one character and a set of child nodes each pointing to a subtrie or to null.  Taking the insight from the prefix trie, we can intuitively see that word’s suffix is essentially a prefix if the word is reversed:  Hence:

“abcdefg” -> “g”, “fg”, “efg”, “defg” etc are all suffixes.  If we reverse the word,

“gfedcba”, we can construct the prefix trie out of the map of reversed words.  So in this case, “g” becomes the root, followed by “f”, “e”, “d”, and so on.

Construction of the suffix trie is trivial, once we have the code in place for constructing the prefix trie.  From our dictionary, all that is needed is to generate a dictionary using each word in reverse, then pass it into our trie constructor.

/* Example: $dictionary starts as  
* $scope.dictionary = {
*     'aardvark': '...',
*     'actor':  '...',
*     'adulation': '...'
* }
        $scope.inverseDictionary = Object.keys(, key) {
            memo[key.reverse()] =[key];
            return memo;
/* $dictionary becomes:
* {
*    'kravdraa': '...',
*    'rotca': '...',
*    'niotaluda': '...'
* }

    	$scope.root = toTrie(Object.keys($scope.dictionary));
        $scope.reverseRoot = toTrie(Object.keys($scope.inverseDictionary));

Suffix/Prefix Trie Insights

Note that a suffix trie is essentially a subset of a “suffix tree” which contains pointers to every suffix of every word inserted into the data structure.  Compare this to the simpler approach I’ve taken here which just contains each word in the dictionary.

Now when the user types in a query into the searchbox, we can scan our prefix trie to get a set of all words in the dictionary where the query is a prefix, and simultaneously we can also retrieve a set of all words in the dictionary where the query is a suffix as well (our set of results will be reversed, so we need to reverse them again).  This extends our base functionality with allowing users to specify a wildcard (e.g. ‘*’) in the query string which we can use to scan both tries.  I won’t talk about the trie traversal here, which you can see an implementation of in my past post: Search Series Part VI – Autocomplete & Prefix Tries


Prefix trie traversal - "act*", would yield ["actor", "acting", etc.]
Suffix trie traversal - "*act", would yield ["exact", "enact", etc.]

Finally, by maintaining both a prefix trie and a suffix trie, we can implement basic intermediate wildcard functionality (up to 1 wildcard).  All we need to do is find the set of prefix results and suffix results and intersect the arrays.

As an aside: why doesn’t this approach support multiple wildcards (e.g. int*me*tion, which would yield [“intermediation”, “intermention”, etc.])?  The answer is the implementation of the suffix trie, which only contains the entire word (as opposed to all suffixes).  If we tokenize int*me*tion into [“int”, “me”, “tion”], we see that me will not return any prefix/suffix results that we expect because with our given prefix and suffix trie, we can only identify words that either begin with “me” or end with “me”, not have “me” somewhere in the middle.  To add this functionality, the implementation would need to be expanded to full blow suffix trees.  (Alternatively, we could also use something more powerful like ternary search trees).


Prefix & Suffix Tree intersection: "inter*tion", yields ["intersection", "interaction", etc]

(i.e. all matching words were inter is a prefix and tion is a suffix)

Here’s a sample implementation:

var wildcard = function(textIn) {
        if (!textIn) { return; }
        var substrs = textIn.split('*');
        var p = textIn[0] === '*' ? 1 : 0;
        var s = textIn[textIn.length-1] === '*' ? 1 : 0;
        var prefixes = substrs.length > 1 ? substrs.slice(p, substrs.length - 1 + s) : substrs;
        var suffixes = substrs.length > 1 ? substrs.slice(1 - p, substrs.length - s) : substrs;
        $scope.results = _(prefixes).map(function(prefix) {
            return $scope.autocomplete(prefix, $scope.root);
        $scope.reverseResults = reverse(_(suffixes).map(function(suffix) {
            return $scope.autocomplete(suffix.reverse(), $scope.reverseRoot);
        $scope.intersectedResults = _.intersection($scope.results, $scope.reverseResults);
Example wildcard search
Example wildcard search (inter*tion)
example wildcard search
Example wildcard search (am*ing)

Search Series Part V – Search Optimizations (multi-word fuzzy match and highlighting)

In the previous search engine series, I demonstrated how to implement a naive fuzzy match using Levenshtein distance.  I call it naive because the distance metric is based on matching the user’s query to matching words in the dictionary.  In today’s blog post, I show how to extend that functionality across multiple words, taking into account spatial locality (an overloaded term) regardless of where they occur in the original document.  I also show you how to add contextual highlighting of the user query of the search results.

Foreword: I’ve also done some refactoring to clean up the code a bit.  I had begun writing the text search example using custom methods, but decided for readability sake to extend Array.prototype for some useful functionality.  Another library out there that does pretty much the same array transformations I need is lodash, but I opted not to include it (no shortcuts!).  However, where certain functions are implemented in standard javascript (ES5), I’ve used those (e.g. Array.prototype.filter).  I think the code reads more cleanly now.

Multi-word fuzzy matches

There’s no real secret to multi-word matches.  For those keeping up with this series, if you look at the code, you’ll notice that it all builds on the last iteration.  Multi-word search is no different.  Previously, I keyed the ngram index using a list of ngrams formulated from the user’s query.  Our search method followed this pattern:

user query > array of user query ngrams > array of fuzzy query ngrams > array potential matches > sorted array of word matches

See this image below (which I spent way too long creating):

search in a nutshell

You can see from this diagram that when the user entered in multiple words, we would never get good matches unless the words were adjacent.  The reason is simple: Two different words will never have intersecting potential matches.  While they could have potential fuzzy matches, this result is erroneous signal.

The easy and obvious fix for this is to split the user’s query by word, then conduct a fuzzy search through the index using all ngrams associated with those words.  Much like we did for fuzzy searches with fuzzy ngrams, we move up one level of abstraction to N searches.  This returns an array of array of Words, so we’ll need to flatten the array and then arrange them somehow.

Here is the sample implementation for this:
Note: as described above, I rewrote some of the methods to make this easier to read (less nested parentheses)

function getRankedWords(word) {
            var ngrams = Object.keys(ngramize({}, word, 0));
            var fuzzyNGrams =
            						.filter(function(ngram) { 
                                        return ngrams.indexOf(ngram) === -1; 
                                    }); //only take unique fuzzy ngrams
            var exact =;
            var fuzzy =;
            return rank(exact, fuzzy, word) || [];
        if (!textIn || textIn.length < 3) return [];
        return textIn.split(' ').map(getRankedWords).flatten();

Ranking result sets

Since each search returns a set of exact and fuzzy matches, this implementation necessitates (as it did for fuzzy matches) another ranking scheme to arrange the results in order of relevance to the user.  In my implementation, I opted for a quick and dirty solution of ranking via spatial locality (I know this is an overloaded term related to caching schemes).  Essentially, based on the number of words of context, if any two results fall within the range, they form the same “result.”  Note that since we are no longer matching 1 word at a time, I created a new class called a Result, which contains the string matching the search result’s context, gave it a new score (just an inverse distance calculation) and also added highlighting to the context.

// Result: has a context and score
var Result = function(context, score, highlighted) {
    this.context = context;
    this.score = score;
    this.highlighted = highlighted;

With this, we define the interface for a search result, and are abstracted away from words.  You can see how using abstractions like this can be very powerful.  Instead of words, we may create indices that point to documents.  Then for each document, we can reference the correct index.  In this manner, we can have distributed indices across a cluster and get faster lookups than storing all indices in memory on one machine.

Here’s a sample implementation:

    // Rank sets of results 
    // Algorithm
    // Traverse through flattened results, if the D (delta) 
    // between adjacent positions i < $context, combine results
    // and aggregate scores (i.e. more relevant)
    var rankSets = function(sets) {
		function score(distance) { return distance === 0 ? 1 : 1/(distance + 1); };
        var sorted = sets.sort(Word.prototype.sortByPosition);
        if (!sorted || sorted.length === 0) { return []; }
        var prev = sorted[0];
        var results = [];
        for (var i = 1; i < sorted.length; i++) {
            var curr = sorted[i];
            var _score = score(prev.distance);
            var lowPosition = prev.position;
            var highPosition = lowPosition;
            while (curr.position - prev.position <= $scope.context) { //spatially close
                highPosition = curr.position;
                _score += score(curr.distance);
                prev = curr;
                curr = sorted[i];
            var _context = $scope.dict.slice(lowPosition - $scope.context,
                                             highPosition + $scope.context).join(' ');
            results.push(new Result(_context, _score, highlight($scope.textIn, _context)));
            prev = curr;
        return results.sort(Result.prototype.sortByScore);

You can see that the ranking algorithm is simple.  I start by sorting the array of Words by position in the document.  Recall that position in a Word is a pointer to its location in the document.  Since we already computed a distance metric per Word compared with the corresponding word in the user’s query, we can reuse that computation to form a “score.”

The score method itself is also very simple.  It’s just an inverse distance metric, such that “exact matches” return a score of 1, whereas imprecise matches return a score of 1/(distance +1).

Finally, as I traverse through the array of Words, when I find two words where the distance of the positions (in words) is less than the context, I increment the high and low position that was used to compute the context within the document.  Then I aggregate the scores of the adjacent Words.  If a Word has no neighbors within that range, the most recent positions are used to compute the Result.  In the case where the Word is isolated, this just returns the same context as before.  Where a Word had neighbors within range, this returns a context with a max range of (context * 2 + (highPosition – lowPosition)) words.  We could truncate this to make all contexts the same size (but I opted not to).

Finally, since we’ve been iterating through our results and combining Words to make new contexts, we have to sort the set by score.

Highlight all the things

Highlighting is also a pretty easy add.  When I instantiate a Result and push it into our results array, I have all the information I need to compute whether the result should contain highlighting or not.  Simply by breaking apart the users query and using a string replace, we can return a html fragment ready to be rendered by the browser.

Sample implementation:

    var highlight = function(textIn, context) {
       return context.replace(
           new RegExp("(" + textIn.split(' ').join('|') + ")", "gi"), 

That’s pretty much all it requires with this implementation.

multi-word fuzzy matches with highlighting

Further optimizations:

  • As this series is only meant to demonstrate how to implement some basic textual search techniques, it still doesn’t optimize for performance.  We are still using extra space with our ngram index (storing each word) and are using O(n^2) to intersect the arrays of matches.  It would be better to keep our indices sorted so we can have faster intersection time.
  • Distributed indices would be a good idea.  I briefly mentioned this above, but we have all the tools necessary such that we can also create indices of ngrams to documents instead of words.  We can then use these ngram indices to filter out all documents that don’t match a user’s query, then use a pointer to another index to do the textual search within the document.
  • Better ranking metrics.  Additionally, my ranking algorithm is very basic and is based purely on distance and spatial locality.  We could perhaps do something more useful, like score a Result differently depending on how many words apart each Word was. Also, because we use reuse the context variable, our maximum lookup range is small and likely wouldn’t even cover a sentence.

Here’s the full implementation so you can play around with it: Search Engine Series V – Multi-word matches and highlighting

Search Engine Series IV – Fuzzy Matching

In the last part of the series, I introduced the concept of adding context around search results using pointers to the reference in the original document.  While this works well for exact pattern matching, it fails for scenarios where the user may have mistyped, misspelled or doesn’t type out the entire query.  In this entry, I’ll explain how to add some basic fuzzy matching to your full text search.

Why Fuzzy Search?

Essentially what we’ve implemented in parts I – III of the search engine series is a less powerful version of grep. Cool in some ways, but we’re trying to achieve a different goal here.  Note that grep allows regular expression matching which we could implement as well.  The way this would work is rather than breaking apart the user’s query into ngrams then keying off our hash to merge the matches, we could iterate through the trigrams we found to see which matched the user’s regular expression, then merge the matches.

Instead, we are implementing a basic version of search through documents, so we want to have some fuzziness feature.  Fuzziness naturally introduces imprecision into textual search.  If we consider fuzziness a dial which we can turn up or down, In order to implement it, we also need to have a weighting function that will compare similarities between strings to determine whether we are under our fuzziness threshold.

I’m going to use a common distance metric called Levenshtein distance (but you could substitute your own).  Levenshtein distance essentially measures the number of transformations (add a character, delete a character, or change a character) to transform string A into string B. Example:

String A: abcdeeefg

String B: accdefg

String B could be converted into String A by doing the following:


Where red = deletion, blue = addition; in other words, String A and B have a Levenshtein distance of 3 (substitute first ‘c’ for ‘b’, add ‘ee’ after the first ‘e’ in String B).  Random Shoutout: you can use my other website to diff two strings and see this type of comparison!  Coolness!!  (note, the implementation is completely different)

Back to Levenshtein distance.  The slow implementation for computing this distance metric is fairly trivial, but conveys the point:

/* What this is doing:
* Recursively computing levenshtein distance on substrings of decreasing size.
* If the characters in the first position are equivalent, distance is 0, else 
* distance is 1.  Note the prime opportunity optimizing w/dynamic programming.
function levenshtein(a, b) {
    if (a.length === 0) return b.length;
    if (b.length === 0) return a.length;
    return Math.min(
        levenshtein(a.substring(1), b) + 1,
        levenshtein(b.substring(1), a) + 1,
        levenshtein(a.substring(1), b.substring(1)) + (a[0] === b[0] ? 0 : 1));

How to map distance to fuzziness

This is all well and good, but how do we use Levenshtein distance for matching?  Here’s where the science gets a little imprecise (at least in my mind).  There are really two places where we can use Levenshtein distance to compare the search query with our result set.  1 – as we break apart the user’s query into ngrams, we can find other ngrams with a distance < fuzziness, and 2 – when we return a set of results that may be close but no cigar, we determine whether the matching word and the query have a distance < fuzziness.  In my implementation, I do both.

Fuzzy Ngrams

We have to be careful with 1, because depending on the size of our ngram (trigrams in our case), if the fuzziness is set too high, we may just end overinclusive of the fuzzy ngram matches against which to find our result set.  Proof: Any ngram has a max Levenshtein distance with another ngram of equal size of n.  So in my naive implementation, I only select ngrams where the distance <= 1 (note that since our trigrams form a set, the distance will never == 0).

It behooves us to precompute the hash of trigrams to their fuzzy matches, because the alternative would be iterating through all keys in the trigram hash -something very expensive to do in real time.

    var computeFuzz = function(hash, fuzziness) {
        var ngrams = Object.keys(hash);
	$scope.fuzzMap = ngrams.reduce(function(fuzzMap, key) {
            fuzzMap[key] = select(ngrams, function(k) {
                return getEditDistance(k, key) <= fuzziness;
            return fuzzMap;
        }, {});

Fuzzy Matches

Then, when merging the results, we need to determine how to sort our results.  It turns out we can also use Levenshtein distance here to rearrange the results in an order that makes sense, but this ranking can be whatever you want.  My implementation simply stores and sorts by Levenshtein distance to the user’s query.  If the distance is greater than some fuzziness factor, we can just throw out the result.  Note: In order to speed up the sorting, since we know exact matches are going to be the most precise (yet may have a large distance because the matches are substrings), we always push those to the front of the results and sort the remainder of the results returned from the fuzzy search.

Additionally, because fuzzy matches may overlap with exact pattern matches, we have to transform the search results into a set (i.e. call unique() on the result set).

var rank = function(exact, fuzzy, textIn) {
        var sorted = unique(select(flatten(fuzzy).map(function(word) {
            return new Word(word.value, word.position, getEditDistance(word.value, textIn));
        }), function(word) {
            return word.distance <= $scope.fuzziness;
        }), Word.prototype.uniqueWord).sort(Word.prototype.sortByDistance);
        return unique(merge(exact).concat(sorted), Word.prototype.uniqueWord);


In my implementation, I added a few details for precomputation/speed.

  • In order to figure out what fuzzy trigram matches there are, after I build the initial trigram index, I iterate through the trigrams creating a second map keyed on the trigram to its fuzzy matches.
    • Example: “abc” -> [“abd”, “bbc”, …]
  • After splitting the user’s query into trigrams, I get the set of trigrams pertaining to a fuzzy match, then filter out the non-fuzzy trigrams.
  • I use a faster implementation of Levenshtein distance that takes advantage of dynamic programming.
  • When reordering the results, I map a new distance from each potential fuzzy match to the original search query.  I only do this for fuzzy matches, to reduce computation time.

See the full implementation here: Search Part IV: Fuzzy Search

Disclaimer: As usual, things could be sped up and optimized, but I tried to keep things simple (e.g., using a simple distance calculation and simple ranking algorithm) for readability.

Fuzzy Matching example
Fuzzy matching sample. Try playing around with the fuzziness tolerance. Also: I’m getting fancy with CSS. Watch out!


  • There is a large variety of distance formulae out there.  You might get better results with one or another.
  • Trigrams are pretty standard for full text search, but if fuzziness is a real concern, then it might be worth trying with quadgrams and a distance of two, however, I suspect that there would just be too many quadgrams and the search would slow down.

Search Engine Series Part III – Contextual Search

In part II of this search series, I covered exact pattern matching.  Through a clever use of array intersection, we were able to eliminate false positives/irrelevant results when matching a user’s query.  However, that ability in itself is not very useful if context cannot be provided around the search.  In part I, I covered generating your ngram index.

In this post, I discuss how to provide context (i.e. words before and after) with each “hit”.

How to do it

Again, as with adding exact pattern matching, adding context to your search isn’t difficult.  Before, our ngram data structure keyed a trigram to a set of words matching the ngram.  However, we lost the place within the original document where the word could be found.  If we follow that approach, for every hit returned from our search, we would have to traverse the entire document for matches, then return those results.  Not good.

Instead, it’s better to keep pointers to the relevant word in the document.  Since words can be repeated per document, we can simply pass in the index of the split dictionary into a new Word object.  Here’s the class:

// Word: has a string word and an int position
var Word = function(value, position) {    
    var self = this;
  	this.value = value;
  	this.position = position;
    this.equals = function(word2) {
        return self.value == word2.value &amp;amp;&amp;amp; self.position == word2.position;

Then we need to modify the method ngramize to adhere to our new data structure.  Instead of pushing onto our array references to the word, we need to instantiate a new Word() object for each ngram match.

// NGramize a word
    function ngramize(hash, word, index) {
        for (var i = 3; i < word.length+1; i++) {
            var s = word.substring(i - 3, i);
            if (!hash.hasOwnProperty(s)) {
                hash[s] = [new Word(word, index)];
            } else {
                //todo, still eliminate dupes
                hash[s].push(new Word(word, index));
        return hash;

Note that this method has nearly remained the same as in the previous implementation.

Finally, we now need to rewrite merge to intersect all arrays of matches based on equality of Word. To do this, I wrote a simple method called equals that lets one Word be compared to another. A Word is equivalent to another Word if the value and position are identical (i.e. they are the same reference within the document).

Finally, our merge method is rewritten to take advantage of this intersect method.

    var merge = function(arrays) {
        var returnArray = [];
        if (!arrays) return [];
        var base = arrays.pop();
        var intersected = arrays.reduce(intersect, base);
        return intersected;

See how the merge method has been simplified? Awesome!

End result

The end result is instead of getting an array of strings (i.e. word matches) based on the user’s query, we now get an array of Words. Since each Word has a pointer to the word in question in the original document, all we need to do is take a slice of the original document based on how much context the user wants and join them together again (using a space).

Contextual matching with # word look-behind/ahead

For a sample implementation: see here: Search Part III: Contextual Search

Disclaimer: Again, this is just a toy implementation! To productionize properly, you may want to implement a few of the other features mentioned below.

Improving Search Further

If you’re still with me, you’ll find lot of opportunities for optimization. I haven’t added them in my toy implementation, but they wouldn’t be hard to do. As the original document increases in size, it becomes more and more necessary to implement some of these enhancements, but I’ve kept things simple for readability.

A few potential areas for improvement:

  1. Space efficiency.  Notice how we’re storing the value of the word (i.e. the string) in each Word.  That’s really wasteful!  Words can be long (e.g.: “supercalifragilisticexpialidocious”), so we can much more efficiently store pointers to words, as we store pointers to the position within the original document.
  2. Smarter precomputation of ngram index. We can sort the arrays for faster intersection!  Though our insertion time into the array in the initial computation of the ngram hash is O(1) right now (we just push it to the end of the array), we can reduce the speed of initial computation for faster recombination later on.  Read: Keeping sorted ngram arrays requires binary insertion of each new Word, something done in O(log n) time.
  3. Faster array intersection.  Example: If we store pointers to words instead of the actual word itself, this method lends itself to intersection of all arrays in O(n) time, instead of O(n^2) time.  This lends itself to an awesome speed up!

Note: I’ll implement these things at some point in the future, so you have something to look at.

And of course, on top of all these things, we could add your run of the mill features like caching, keeping your index in memory, distributing your indices, etc.

Next Up:

Finally, what else can we do to improve search?  Well, right now our search is Case-Sensitive, which isn’t good.  We’ll want to fix that.  Furthermore, our search requires exact matching and doesn’t allow for “fuzziness” in the search.  I don’t know about you guys, but I have a hard time spelling things correctly.  My fingers move too fast for the keyboard to keep up; it’s a problem stemming from the AIM era, so forgive me 🙂

We can also add contextual highlighting, within our results.  This is definitely a big visual help that lets a user know where within the search result the query is found.

See you next time!

Search Engine Series Part II – Exact pattern matching

In Part 1 of the series (here:, I described how to construct ngram indices for a body of text.  We saw however, that ngram matching while quick, is imprecise because it matches too greedily.  By parsing the query also into ngrams, we ended up getting matches that the user would not have been looking for.

In this part II, I discuss how to achieve exact pattern matching.  Relevance is more than a function of matching a query to a list of results.  There is a psychological component.  The gist of surfacing relevant data thus means returning results that are in the realm of expectations for a user.  In other words, while a result set can be filtered in a technical sense to a set of results to return, it’s no good if a technical match is out the realm of expectation for the user.  For example, using a simple weighting function like Levenshtein distance may yield a technically high scoring result, but if the minimal string transformations yield an unexpected result, the result is no good (e.g. “and” and “end”).  We can minimize the effect by naively enforcing exact pattern matching.


In part I, parsing the users query and piping it through the ngram map yielded an array of matching words to any ngram. We stripped the array of duplicates and returned the entire set to the user. However, we saw many false positives returned along with the good results. How do we eliminate those false positives?  It turns out that achieving exact pattern matching is pretty simple and elegant. Each ngram array returned (e.g. for the query “governmen”) included an array of results that could be potential matches. In order to eliminate extraneous non-matches (like “fundamentally” or “settlement”) which match on “men”, we simply need to take the intersection of all result arrays.  This yields a precise match against the string.

var merge = function(arrays) {
        var returnArray = [];
        if (!arrays) return [];
        var matches = {};
        arrays.forEach(function(array) {
            array.reduce(function(matches, word) {
                if (!matches.hasOwnProperty(word)) {
                    matches[word] = 1;
                } else {
                    matches[word] += 1;
                return matches;
            }, matches);
        //If the word intersects all arrays
        returnArray = select(Object.keys(matches), function(word) {
            return matches[word] === arrays.length;
        return returnArray;


Why does this work?  Essentially, with our prior greedy approach, as we parsed the user’s query and filtered our hash, keying on each trigram that was created, we were left with an array of arrays of all potential matches.  While this represented the complete set of matches, this also created situations where to qualify as a “result”, the word only needed to contain any trigram in the query.  Thus, each word within our filtered results represented in a partial match of the user’s query.  Ergo, to match a user’s query exactly, each word must match every ngram within the user’s query.  How did we find the list of words matching all trigrams within the user’s query?  Simple, just intersect the arrays.

See the implementation here: Part II: Exact Pattern Matching

Usual disclaimer applies: Keeping it simple for the sake of explanation, could be refactored, yadda yadda yadda.

Much better matching
Much better matching

Now we’re getting somewhere.  However, our search is still incomplete.  It’s not useful to simply return the word that matches the user’s query.

  • We want to return some context (i.e. the words before and after) the exact pattern match so the user can determine whether the result is useful or not.
  • We can also optimize our data structure for computeNGram.  Right now we store only unique matches per ngram, which was OK when determining if there is an exact match.  However, in order to store context, we can no longer quash duplicates.
  • We can optimize performance as well in order to still provide fast matches in the above bullet.

Stay tuned for part III.

Search Engine Series Part I – Basic ngram indexing

At Twitter, I’ve been spending a lot of time thinking about search engines and how to effectively index and surface data.  When it comes down to it, indexing methods mean a lot, but so do other some of “brute force” factors like:

  1. the quality of the data indexed
  2. the amount of data indexed
  3. transforming search queries through an effective DSL
  4. surfacing and ranking search results

Search has no silver bullet

Perhaps I’ve been learning more and more that there’s no magic bullet to search.  There’s several methods to indexing volumes of data, e.g. string tokenization, ngram tokenization; there’s levels of strictness of search, e.g. exact matches, fuzzy matches, partial matches; then there’s transformation of the search query itself, e.g. what indices to search, what parts of a query must be included in search results, fuzzyness matching and mispelling.

In this next series of posts, I’ll be describing some search basics that may be useful to the reader.

Computing NGram indices

Ngram indices are a good starting point for creating a searchable index of words.  I’ve included an example implementation here: Part 1: Constructing a Trigram index

Disclaimer: In this example, I use a toy implementation and trigrams for the sake of simplicity.  For the dictionary “corpus”, I use the Declaration of Independence.  The code examples are meant to only illustrate the basic implementation details and aren’t meant for production use (however, the code will be refined over time).

Constructing ngram indices is very simple – essentially one constructs a map of n-character length substrings from a dictionary of words, where each ngram maps onto the array of matching full words.  Different techniques can then be applied on top of this, including

  1. Scoring functions – for example, for the number of matching words in a document, take a basic distance b (where b is the # characters that match)
  2. Fuzzy matching – replace one character for each character of the trigram to generate a list of other potential matches
  3. Context – keep pointers to place in document to determine the contextual sentence

    // Compute NGrams
    var computeNGrams = function(dict) {     
        return dict.split(' ').reduce(function(hash, word) {
            return ngramize(hash, word);
        }, {});
    // NGramize a word
    function ngramize(hash, word) {
        for (var i = 3; i < word.length+1; i++) {
            var s = word.substring(i - 3, i);
            if (!hash.hasOwnProperty(s)) {
                hash[s] = [word]
            } else {
                if (hash[s].indexOf(word) == -1) {
        return hash;


You can see that creation is simple:  Essentially, you just pass in a word of length i, traverse through the word creating all substrings of length N, where N is the “N” in “NGram”.  In this case, I’ve chosen to use trigrams, or NGrams of length 3.

As the user types in his/her query, we break that query apart into ngrams as well, and match against our ngram map (i.e. the corpus of data).  This gives us an array of arrays of matching words per query.  Then we simply flatten the array, remove duplicates and present the results to the user.


Too many matches
Too many matches

However, the astute reader will notice that matching ngrams to arrays of occurrences does not provide a useful search in itself whenever the query extends beyond the ngram length.  How do we solve for this?  The solution, it turns out is fairly simple.  Stay tuned for the next post.