Search Series Part VI – Autocomplete & Prefix Tries

I’ve spent a lot of the last posts demonstrating how to perform full-text fuzzy searches, highlighting, and index generation, and talked about some of the pros and cons of ngram indexing.  This post departs from search and focuses on solving a different problem: autocomplete.  As I mentioned in the first post, there is no silver bullet to search relevance. At Twitter, this means trying to tackle relevance from many angles – index generation, manual ranking of documents, query classification, etc.  Another means of getting better relevance is by trying to guide the user’s query to known queries with good sets of results.  In this post, I’ll demonstrate how to generate build out an autocomplete functionality that can be used to get the user to the right query.

See the sample implementation here: http://Search Series Part VI – Autocomplete

Why autocomplete?

This should be pretty obvious to most readers.  You’ve most likely encountered autocomplete in many places.  For example, when you type into Google’s search bar, you’ll get a list of suggested queries.  When you begin typing an address to Google maps, you’re presented with a list of possible matches.  The google-esque type of functionality probably requires a different solution altogether than the solution I’m proposing.  This post, on the other hand, describes one possible implementation of a way to provide some (but not all) autocomplete functionality.  A slightly non-obvious benefit to autocomplete is that adding autocomplete functionality not only serves as a convenience to the end user, but also is a forcing function that helps to drive search relevance.  By guiding the user to a set of searches with a set of high quality search results, one can help guide the user to those results through query modification.

Trie trie again

A trie (pronounced “try”, as in retrieval) is simply a tree-like data structure where each node in the tree can have 0 -> N leaves.  The familiar binary tree, for example, is a trie which imposes the restriction that each node has at most 2 leaves (a left and right leaf). Prefix tries on the other hand are constructed such that each path to the root is a prefix of the path to the node, where each node contains a character and a set of children.  In other words, if we created a prefix trie as such:


c > o > [d > e, k > e]

In this trie, “code” is one potential path, and each path to the root, “cod”, “co” and “c” are all prefixes of “code”.  However, not all of those prefixes are valid words, so the prefix trie must be able to distinguish between good and bad entries.  In my sample application, I use this use this data structure to efficiently store an entire dictionary, with the guarantee that there is only one path per entry (autocomplete result).  To get to the list of results that can match, I traverse the prefix trie as far as I can go with the user’s query, then use various traversal techniques to find all suffixes for a particular prefix.

Here’s a sample implementation for a prefix trie:

var Node = function(c, parent) {
    var self = this;
    this.c = c; // the value of the node
    this.children = {}; // a map representing the children
    this.parent = parent; // a pointer to the parent
    this.terminal = false; // whether there is a word at this node
    
    /* insert a word at this node */
    this.insert = function(word) {
        if (!word || word.length === 0) { return; }
        ch = word[0];
        if (!self.children.hasOwnProperty(ch)) {
            self.children[ch] = new Node(ch, self);
        }
        var n = self.children[ch];
        if (word.length === 1) { self.children[ch].terminal = true; }
        n.insert(word.substring(1));
    };
};
    

A few words about this implementation:

1) Node – Each node in my implementation has one character as the value and a set of children.  Prefix tries can be constructed with a variety of data structures to represent children for each node (e.g. arrays, linked lists, maps).  In my implementation, I opt to use hash maps for space optimality and constant speed lookup.  However, other popular implementations use arrays or sorted linked lists, which enables a fast sorting of a set of words through a preorder depth first search of the prefix trie (described below).
2) terminal: Boolean – For each word inserted, a path is created from the root node to the leaf node. Thus, it can be assumed that any path from the root node to a leaf node without children is a valid entry. However, there are valid entries that are prefixes of other entries, so there must be a way to distinguish between a valid and invalid path. Hence the terminal flag.
3) duplicates are not stored in the prefix trie (ergo, each path represents a unique result).  We could easily store the number of occurrences of each prefix in the trie.

And here are some interesting properties of prefix tries:
1) Prefix tries can be faster than hash maps for word look up – The longest path to look up a word in the trie takes O(m) time, where m is the length of the word. For a hash map, with a bad hashing function, you could take in the worst case scenario O(n) time, where n is the number of entries in the hash map.
2) Suffix detection – any path to a parent node represents a prefix of all paths through the parent node. Thus, it’s possible (and necessary) to construct an array of all suffixes stemming from the prefix and autocompleting a given result.
3) Sorting – as described above, prefix tries can be used to sort a set of words so long as the children keys are sorted. In this implementation I use hash maps, so the keys are not sorted by default (but we could opt for a different data structure).  Once the prefix trie is constructed, one only needs to traverse via DFS.

4) Prefix tries can be space efficient – Since only all unique prefixes need to be stored in the trie, we don’t need to store needless duplicate keys.

Prefix Trie Traversal

As with typical trees, one can opt for different methods of traversal to construct substrings from a parent node. I’ve provided a few different implementations below.

    /* BFS */
    this.traverseBFS = function(array, prefix, limit) {
        var self = this;
        var queue = [];
        queue.push([self, prefix])
        while (queue.length > 0) {
			var entry = queue[0];
            var n = entry[0];
            var p = entry[1];
			if (n.terminal) {
                array.push(p + n.c);
                if (limit > 0 && array.length >= limit) {
                    break;
                }
            } 
            for (var k in n.children) {
                queue.push([n.children[k], p + n.c])
            }
            queue = queue.splice(1);
        }
    };
    
    /* DFS */
    this.traverseDFS = function(array, prefix, limit) {
        var self = this;
        var keys = Object.keys(self.children);
		for (var i = 0; i < keys.length; i++) { if (limit > 0 && array.length >= limit) {
                break;
            }
            self.children[keys[i]].traverseDFS(array, prefix + keys[i], limit);
        }
        if (!self.terminal || (limit > 0 && array.length >= limit)) {
            return;
        }
        array.push(prefix);
    };

BFS (breadth first search) – With breadth-first search, as we traverse to each child node, we invoke the traverse function call for each child nodes’ children. This is equivalent to traversing by visiting all nodes at each depth before traversing each next level node. This returns matching autocomplete results in order of length. One gotcha here: since BFS uses a queue to control the order of the nodes to visit, we need a way to maintain the prefix up to each node. While each node contains a pointer to the parent, so we could conceivably traverse up to the root node, this would require up to M moves (where m is the length of the word) repeated N times (for each result), resulting in O(MN) time to construct the autocomplete list.

Alternatively, I keep track by simply enqueueing a tuple of the [Node, prefix: String] each time, such that we can construct the autocomplete list by traversing to each node once in O(N) time.

Screen Shot 2015-10-17 at 11.57.04 PM

DFS (depth first search) – With depth-first search, we traverse to the deepest node before starting to recurse. As described above, this method is useful for radix sort if the keys are kept in sorted order (they aren’t in my example). As with normal DFS, we can traverse the tree preorder, inorder or postorder. You can see from my example that I traverse as far left as possible until the left node is a leaf node, then I visit that node, then visit each sibling node called with the prefix up to that point.

depth first search

Some thoughts:

  • Without any intelligence (algorithmic or otherwise), autocomplete only has limited utility in cases where the result set is large.  You can see in this instance, where there are about 109k words in the dictionary, that limiting the user to 10 results probably wouldn’t get the user to the desired query.
  • This current implementation doesn’t account for mispellings in the word or any leniency (e.g. leaving characters out, adding extra characters).  However, we can combine elements of the ngram text search discussed in the first 5 parts of this series. Essentially, the first ngram of a word is also a prefix of the word, so we can use the prefix trie to do some quick filtering, then return a set of fuzzy matches.
Advertisements

A simple n-queens solution in javascript

N Queens (8)
N Queens (8)

I was recently just playing around with the n queens problem and decided to try to solve it. I’ve never done it before, but figured that I could use a depth first search kind of solution. I ended up building a solution in node.js which does identify all potential solutions, but decided to put it in angular.js to run on the front end so users can play around and visualize the solutions.

Essentially, the algorithm works like this:

  • If there’s only 1 queen left to place, and multiple possible positions, then it’s a valid solution. If it’s a unique solution, let’s store it.
  • If there is more than one queen left to place, and the number of valid positions left is less than the number of queens, then this is the wrong path, let’s backtrack.
  • If neither applies, lets iterate through all the next potential moves in the adjacent column, then recompute the possible positions from that grid state.
  • Finally, after we place the queen, we recursively call the algorithm for one less queen, a new grid state, recomputed valid positions and the next column.

I haven’t had much time to optimize, but a few ideas I could possibly do:

  1. Since a chess board is symmetrical, I probably don’t need to check all four corners (since they’ll yield the same result if the board is rotated 90 degrees)
  2. Use a stack instead for managing the valid positions instead of a simple array. Right now I actually make a copy of all new possible positions after a queen is placed.
  3. Use a better way to come up with unique solutions than concating a string and comparing each new solution that comes in

A couple optimizations that already helped:

  1. pruning next potential positions to only the starting column + 1 (enormous performance enhancement)
  2. pruning routes when number of queens left > number of open positions
  3. cutting down the length of the unique string when comparing objects

In any case, it was a fun little experiment. Here’s the source code, and here’s where you can see it working in jsfiddle. Warning!!! Angular tends to run pretty slow and blocks when you compute, so don’t try to run with too many queens (like > 12 or 13) otherwise you may cause your browser to freeze.

N Queens – http://jsfiddle.net/rocketegg0/wu6cpp5v/

Some average run times:

Run Times
Num Queens Solutions # Comparisons Run Time (ms)
4 2 12 <1
5 10 41 1
6 4 138 1
7 40 473 1
8 92 1758 7
9 352 7077 16
10 724 30654 45
11 2680 142755 360
12 14200 734048 4936
function Grid(width, height) {
        this.width = width;
        this.height = height;
        var grid = [];
        var validPositions = [];
    
        for (var i = 0; i < width; i++) {
            grid.push([]);
            for (var j = 0; j < height; j++) {
                grid[i][j] = '_';
                validPositions.push(new Pair(i,j))
            }
        }
    
        this.getValidPositions = function() {
            return validPositions;
        }
    
        this.getGrid = function() {
            return grid;
        }
        
        //Position validation
        function isQueen(pair, x, y) {
            return pair.x == x && pair.y == y;
        }
        
        function isInRow(pair, y) {
            return pair.y == y;
        }
        
        function isInCol(pair, x) {
            return pair.x == x;
        }
        
        //Ex: 
        //3, 2 > 4, 1 = -1, 1
        //3, 2 > 4, 3 = -1, -1
        //3, 2 > 2, 1 = 1, 1
        //3, 2 > 2, 3 = 1, -1
        function isInDiagonal(pair, x, y) {
            if (Math.abs(pair.x - x) == Math.abs(pair.y - y)) {
                return true;
            }
            return false;
        }
        
        function testPosition(pair, x, y) {
            if (isQueen(pair, x, y) ||
                isInRow(pair, y) ||
                isInCol(pair, x) ||
                isInDiagonal(pair, x, y)) {
                return true;
            }
            return false;
        }
    
        function recomputeValid(validPositions, queenX, queenY) {
            var newValid = [];
            for (var i = 0; i < validPositions.length; i++) {
                if (!testPosition(validPositions[i], queenX, queenY)) {
                    newValid.push(validPositions[i]);
                }
            }
            return newValid;
        }
        
        function gridToString(grid) {
            var gridstr = '';
            for (var i = 0; i < grid.length; i++) {
                gridstr += '['
                for (var j = 0; j < grid[0].length; j++) {
                    gridstr += grid[i][j];
                    if (j < grid.length-1) {
                        gridstr += '|'
                    }
                }
                gridstr += ']'
                gridstr += '\n';
            }
            return gridstr;
        }
    
        var printGrid = function (grid) {
            console.log(gridToString(grid));
        }
    
        var solutions = [];
        $scope.solution_grids = [];
    
        function convertToSolution(grid) {
            var str = '';
            for (var i = 0; i < grid.length; i++) {
                for (var j = 0; j < grid.length; j++) {
                    if (grid[i][j] == 'Q') {
                        str += i + '#' + j;
                    }
                }
            }
            return str;
        }
        
        this.numcomputations = 0;
    
        this.solve = function (numQueens, grid, validPositions, startcol) {
            if (numQueens <= 1 && validPositions.length > 0) {
                grid[validPositions[0].x][validPositions[0].y] = 'Q';
                var solution = convertToSolution(grid);
                if (solutions.indexOf(solution) == -1) {
                    solutions.push(solution);
                    $scope.solution_grids.push(gridToString(grid));
                }
                grid[validPositions[0].x][validPositions[0].y] = '_'; //reset
                return true;
            } else if (numQueens > validPositions.length) {
                return false;
            } else {
                var x, y;
                var nextcol = validPositions.filter(function(point) {
                    return point.x == startcol + 1 ? true : false;
                });    //prune routes to only next col
                for (var i = 0; i < nextcol.length; i++) {
                    x = nextcol[i].x, y = nextcol[i].y;
                    grid[x][y] = 'Q';
                    this.solve(numQueens - 1, grid,
                               recomputeValid(validPositions, x, y), startcol+1);
                    grid[x][y] = '_';	//reset
                    this.numcomputations++;
                }
            }
        }
    }
    
    $scope.solve = function() {
        var nqueens = $scope.sides;
        var grid = new Grid(nqueens, nqueens);
        var startTime = new Date().getTime();
        console.log('Start Time: ' + startTime);
        grid.solve(nqueens, grid.getGrid(), grid.getValidPositions(), 0);
        $scope.numcomputations = grid.numcomputations;
        $scope.endTime = new Date().getTime() - startTime;
        console.log('End Time: ' + ($scope.endTime) + 'ms');
    };