Search Series Part VI – Autocomplete & Prefix Tries

I’ve spent a lot of the last posts demonstrating how to perform full-text fuzzy searches, highlighting, and index generation, and talked about some of the pros and cons of ngram indexing.  This post departs from search and focuses on solving a different problem: autocomplete.  As I mentioned in the first post, there is no silver bullet to search relevance. At Twitter, this means trying to tackle relevance from many angles – index generation, manual ranking of documents, query classification, etc.  Another means of getting better relevance is by trying to guide the user’s query to known queries with good sets of results.  In this post, I’ll demonstrate how to generate build out an autocomplete functionality that can be used to get the user to the right query.

See the sample implementation here: http://Search Series Part VI – Autocomplete

Why autocomplete?

This should be pretty obvious to most readers.  You’ve most likely encountered autocomplete in many places.  For example, when you type into Google’s search bar, you’ll get a list of suggested queries.  When you begin typing an address to Google maps, you’re presented with a list of possible matches.  The google-esque type of functionality probably requires a different solution altogether than the solution I’m proposing.  This post, on the other hand, describes one possible implementation of a way to provide some (but not all) autocomplete functionality.  A slightly non-obvious benefit to autocomplete is that adding autocomplete functionality not only serves as a convenience to the end user, but also is a forcing function that helps to drive search relevance.  By guiding the user to a set of searches with a set of high quality search results, one can help guide the user to those results through query modification.

Trie trie again

A trie (pronounced “try”, as in retrieval) is simply a tree-like data structure where each node in the tree can have 0 -> N leaves.  The familiar binary tree, for example, is a trie which imposes the restriction that each node has at most 2 leaves (a left and right leaf). Prefix tries on the other hand are constructed such that each path to the root is a prefix of the path to the node, where each node contains a character and a set of children.  In other words, if we created a prefix trie as such:


c > o > [d > e, k > e]

In this trie, “code” is one potential path, and each path to the root, “cod”, “co” and “c” are all prefixes of “code”.  However, not all of those prefixes are valid words, so the prefix trie must be able to distinguish between good and bad entries.  In my sample application, I use this use this data structure to efficiently store an entire dictionary, with the guarantee that there is only one path per entry (autocomplete result).  To get to the list of results that can match, I traverse the prefix trie as far as I can go with the user’s query, then use various traversal techniques to find all suffixes for a particular prefix.

Here’s a sample implementation for a prefix trie:

var Node = function(c, parent) {
    var self = this;
    this.c = c; // the value of the node
    this.children = {}; // a map representing the children
    this.parent = parent; // a pointer to the parent
    this.terminal = false; // whether there is a word at this node
    
    /* insert a word at this node */
    this.insert = function(word) {
        if (!word || word.length === 0) { return; }
        ch = word[0];
        if (!self.children.hasOwnProperty(ch)) {
            self.children[ch] = new Node(ch, self);
        }
        var n = self.children[ch];
        if (word.length === 1) { self.children[ch].terminal = true; }
        n.insert(word.substring(1));
    };
};
    

A few words about this implementation:

1) Node – Each node in my implementation has one character as the value and a set of children.  Prefix tries can be constructed with a variety of data structures to represent children for each node (e.g. arrays, linked lists, maps).  In my implementation, I opt to use hash maps for space optimality and constant speed lookup.  However, other popular implementations use arrays or sorted linked lists, which enables a fast sorting of a set of words through a preorder depth first search of the prefix trie (described below).
2) terminal: Boolean – For each word inserted, a path is created from the root node to the leaf node. Thus, it can be assumed that any path from the root node to a leaf node without children is a valid entry. However, there are valid entries that are prefixes of other entries, so there must be a way to distinguish between a valid and invalid path. Hence the terminal flag.
3) duplicates are not stored in the prefix trie (ergo, each path represents a unique result).  We could easily store the number of occurrences of each prefix in the trie.

And here are some interesting properties of prefix tries:
1) Prefix tries can be faster than hash maps for word look up – The longest path to look up a word in the trie takes O(m) time, where m is the length of the word. For a hash map, with a bad hashing function, you could take in the worst case scenario O(n) time, where n is the number of entries in the hash map.
2) Suffix detection – any path to a parent node represents a prefix of all paths through the parent node. Thus, it’s possible (and necessary) to construct an array of all suffixes stemming from the prefix and autocompleting a given result.
3) Sorting – as described above, prefix tries can be used to sort a set of words so long as the children keys are sorted. In this implementation I use hash maps, so the keys are not sorted by default (but we could opt for a different data structure).  Once the prefix trie is constructed, one only needs to traverse via DFS.

4) Prefix tries can be space efficient – Since only all unique prefixes need to be stored in the trie, we don’t need to store needless duplicate keys.

Prefix Trie Traversal

As with typical trees, one can opt for different methods of traversal to construct substrings from a parent node. I’ve provided a few different implementations below.

    /* BFS */
    this.traverseBFS = function(array, prefix, limit) {
        var self = this;
        var queue = [];
        queue.push([self, prefix])
        while (queue.length > 0) {
			var entry = queue[0];
            var n = entry[0];
            var p = entry[1];
			if (n.terminal) {
                array.push(p + n.c);
                if (limit > 0 && array.length >= limit) {
                    break;
                }
            } 
            for (var k in n.children) {
                queue.push([n.children[k], p + n.c])
            }
            queue = queue.splice(1);
        }
    };
    
    /* DFS */
    this.traverseDFS = function(array, prefix, limit) {
        var self = this;
        var keys = Object.keys(self.children);
		for (var i = 0; i < keys.length; i++) { if (limit > 0 && array.length >= limit) {
                break;
            }
            self.children[keys[i]].traverseDFS(array, prefix + keys[i], limit);
        }
        if (!self.terminal || (limit > 0 && array.length >= limit)) {
            return;
        }
        array.push(prefix);
    };

BFS (breadth first search) – With breadth-first search, as we traverse to each child node, we invoke the traverse function call for each child nodes’ children. This is equivalent to traversing by visiting all nodes at each depth before traversing each next level node. This returns matching autocomplete results in order of length. One gotcha here: since BFS uses a queue to control the order of the nodes to visit, we need a way to maintain the prefix up to each node. While each node contains a pointer to the parent, so we could conceivably traverse up to the root node, this would require up to M moves (where m is the length of the word) repeated N times (for each result), resulting in O(MN) time to construct the autocomplete list.

Alternatively, I keep track by simply enqueueing a tuple of the [Node, prefix: String] each time, such that we can construct the autocomplete list by traversing to each node once in O(N) time.

Screen Shot 2015-10-17 at 11.57.04 PM

DFS (depth first search) – With depth-first search, we traverse to the deepest node before starting to recurse. As described above, this method is useful for radix sort if the keys are kept in sorted order (they aren’t in my example). As with normal DFS, we can traverse the tree preorder, inorder or postorder. You can see from my example that I traverse as far left as possible until the left node is a leaf node, then I visit that node, then visit each sibling node called with the prefix up to that point.

depth first search

Some thoughts:

  • Without any intelligence (algorithmic or otherwise), autocomplete only has limited utility in cases where the result set is large.  You can see in this instance, where there are about 109k words in the dictionary, that limiting the user to 10 results probably wouldn’t get the user to the desired query.
  • This current implementation doesn’t account for mispellings in the word or any leniency (e.g. leaving characters out, adding extra characters).  However, we can combine elements of the ngram text search discussed in the first 5 parts of this series. Essentially, the first ngram of a word is also a prefix of the word, so we can use the prefix trie to do some quick filtering, then return a set of fuzzy matches.
Advertisements

One thought on “Search Series Part VI – Autocomplete & Prefix Tries

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s