Search Engine Series Part I – Basic ngram indexing

At Twitter, I’ve been spending a lot of time thinking about search engines and how to effectively index and surface data.  When it comes down to it, indexing methods mean a lot, but so do other some of “brute force” factors like:

  1. the quality of the data indexed
  2. the amount of data indexed
  3. transforming search queries through an effective DSL
  4. surfacing and ranking search results

Search has no silver bullet

Perhaps I’ve been learning more and more that there’s no magic bullet to search.  There’s several methods to indexing volumes of data, e.g. string tokenization, ngram tokenization; there’s levels of strictness of search, e.g. exact matches, fuzzy matches, partial matches; then there’s transformation of the search query itself, e.g. what indices to search, what parts of a query must be included in search results, fuzzyness matching and mispelling.

In this next series of posts, I’ll be describing some search basics that may be useful to the reader.

Computing NGram indices

Ngram indices are a good starting point for creating a searchable index of words.  I’ve included an example implementation here: Part 1: Constructing a Trigram index

Disclaimer: In this example, I use a toy implementation and trigrams for the sake of simplicity.  For the dictionary “corpus”, I use the Declaration of Independence.  The code examples are meant to only illustrate the basic implementation details and aren’t meant for production use (however, the code will be refined over time).

Constructing ngram indices is very simple – essentially one constructs a map of n-character length substrings from a dictionary of words, where each ngram maps onto the array of matching full words.  Different techniques can then be applied on top of this, including

  1. Scoring functions – for example, for the number of matching words in a document, take a basic distance b (where b is the # characters that match)
  2. Fuzzy matching – replace one character for each character of the trigram to generate a list of other potential matches
  3. Context – keep pointers to place in document to determine the contextual sentence

    // Compute NGrams
    var computeNGrams = function(dict) {     
        return dict.split(' ').reduce(function(hash, word) {
            return ngramize(hash, word);
        }, {});
    };
    
    // NGramize a word
    function ngramize(hash, word) {
        for (var i = 3; i < word.length+1; i++) {
            var s = word.substring(i - 3, i);
            if (!hash.hasOwnProperty(s)) {
                hash[s] = [word]
            } else {
                if (hash[s].indexOf(word) == -1) {
                    hash[s].push(word)
                }
            }
        }
        return hash;
    }

Algorithm

You can see that creation is simple:  Essentially, you just pass in a word of length i, traverse through the word creating all substrings of length N, where N is the “N” in “NGram”.  In this case, I’ve chosen to use trigrams, or NGrams of length 3.

As the user types in his/her query, we break that query apart into ngrams as well, and match against our ngram map (i.e. the corpus of data).  This gives us an array of arrays of matching words per query.  Then we simply flatten the array, remove duplicates and present the results to the user.

Example:

Too many matches
Too many matches

However, the astute reader will notice that matching ngrams to arrays of occurrences does not provide a useful search in itself whenever the query extends beyond the ngram length.  How do we solve for this?  The solution, it turns out is fairly simple.  Stay tuned for the next post.

Advertisement

2 thoughts on “Search Engine Series Part I – Basic ngram indexing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s