Search Engine Series Part I – Basic ngram indexing

At Twitter, I’ve been spending a lot of time thinking about search engines and how to effectively index and surface data.  When it comes down to it, indexing methods mean a lot, but so do other some of “brute force” factors like:

  1. the quality of the data indexed
  2. the amount of data indexed
  3. transforming search queries through an effective DSL
  4. surfacing and ranking search results

Search has no silver bullet

Perhaps I’ve been learning more and more that there’s no magic bullet to search.  There’s several methods to indexing volumes of data, e.g. string tokenization, ngram tokenization; there’s levels of strictness of search, e.g. exact matches, fuzzy matches, partial matches; then there’s transformation of the search query itself, e.g. what indices to search, what parts of a query must be included in search results, fuzzyness matching and mispelling.

In this next series of posts, I’ll be describing some search basics that may be useful to the reader.

Computing NGram indices

Ngram indices are a good starting point for creating a searchable index of words.  I’ve included an example implementation here: Part 1: Constructing a Trigram index

Disclaimer: In this example, I use a toy implementation and trigrams for the sake of simplicity.  For the dictionary “corpus”, I use the Declaration of Independence.  The code examples are meant to only illustrate the basic implementation details and aren’t meant for production use (however, the code will be refined over time).

Constructing ngram indices is very simple – essentially one constructs a map of n-character length substrings from a dictionary of words, where each ngram maps onto the array of matching full words.  Different techniques can then be applied on top of this, including

  1. Scoring functions – for example, for the number of matching words in a document, take a basic distance b (where b is the # characters that match)
  2. Fuzzy matching – replace one character for each character of the trigram to generate a list of other potential matches
  3. Context – keep pointers to place in document to determine the contextual sentence

    // Compute NGrams
    var computeNGrams = function(dict) {     
        return dict.split(' ').reduce(function(hash, word) {
            return ngramize(hash, word);
        }, {});
    };
    
    // NGramize a word
    function ngramize(hash, word) {
        for (var i = 3; i < word.length+1; i++) {
            var s = word.substring(i - 3, i);
            if (!hash.hasOwnProperty(s)) {
                hash[s] = [word]
            } else {
                if (hash[s].indexOf(word) == -1) {
                    hash[s].push(word)
                }
            }
        }
        return hash;
    }

Algorithm

You can see that creation is simple:  Essentially, you just pass in a word of length i, traverse through the word creating all substrings of length N, where N is the “N” in “NGram”.  In this case, I’ve chosen to use trigrams, or NGrams of length 3.

As the user types in his/her query, we break that query apart into ngrams as well, and match against our ngram map (i.e. the corpus of data).  This gives us an array of arrays of matching words per query.  Then we simply flatten the array, remove duplicates and present the results to the user.

Example:

Too many matches
Too many matches

However, the astute reader will notice that matching ngrams to arrays of occurrences does not provide a useful search in itself whenever the query extends beyond the ngram length.  How do we solve for this?  The solution, it turns out is fairly simple.  Stay tuned for the next post.

Advertisements

A simple n-queens solution in javascript

N Queens (8)
N Queens (8)

I was recently just playing around with the n queens problem and decided to try to solve it. I’ve never done it before, but figured that I could use a depth first search kind of solution. I ended up building a solution in node.js which does identify all potential solutions, but decided to put it in angular.js to run on the front end so users can play around and visualize the solutions.

Essentially, the algorithm works like this:

  • If there’s only 1 queen left to place, and multiple possible positions, then it’s a valid solution. If it’s a unique solution, let’s store it.
  • If there is more than one queen left to place, and the number of valid positions left is less than the number of queens, then this is the wrong path, let’s backtrack.
  • If neither applies, lets iterate through all the next potential moves in the adjacent column, then recompute the possible positions from that grid state.
  • Finally, after we place the queen, we recursively call the algorithm for one less queen, a new grid state, recomputed valid positions and the next column.

I haven’t had much time to optimize, but a few ideas I could possibly do:

  1. Since a chess board is symmetrical, I probably don’t need to check all four corners (since they’ll yield the same result if the board is rotated 90 degrees)
  2. Use a stack instead for managing the valid positions instead of a simple array. Right now I actually make a copy of all new possible positions after a queen is placed.
  3. Use a better way to come up with unique solutions than concating a string and comparing each new solution that comes in

A couple optimizations that already helped:

  1. pruning next potential positions to only the starting column + 1 (enormous performance enhancement)
  2. pruning routes when number of queens left > number of open positions
  3. cutting down the length of the unique string when comparing objects

In any case, it was a fun little experiment. Here’s the source code, and here’s where you can see it working in jsfiddle. Warning!!! Angular tends to run pretty slow and blocks when you compute, so don’t try to run with too many queens (like > 12 or 13) otherwise you may cause your browser to freeze.

N Queens – http://jsfiddle.net/rocketegg0/wu6cpp5v/

Some average run times:

Run Times
Num Queens Solutions # Comparisons Run Time (ms)
4 2 12 <1
5 10 41 1
6 4 138 1
7 40 473 1
8 92 1758 7
9 352 7077 16
10 724 30654 45
11 2680 142755 360
12 14200 734048 4936
function Grid(width, height) {
        this.width = width;
        this.height = height;
        var grid = [];
        var validPositions = [];
    
        for (var i = 0; i < width; i++) {
            grid.push([]);
            for (var j = 0; j < height; j++) {
                grid[i][j] = '_';
                validPositions.push(new Pair(i,j))
            }
        }
    
        this.getValidPositions = function() {
            return validPositions;
        }
    
        this.getGrid = function() {
            return grid;
        }
        
        //Position validation
        function isQueen(pair, x, y) {
            return pair.x == x && pair.y == y;
        }
        
        function isInRow(pair, y) {
            return pair.y == y;
        }
        
        function isInCol(pair, x) {
            return pair.x == x;
        }
        
        //Ex: 
        //3, 2 > 4, 1 = -1, 1
        //3, 2 > 4, 3 = -1, -1
        //3, 2 > 2, 1 = 1, 1
        //3, 2 > 2, 3 = 1, -1
        function isInDiagonal(pair, x, y) {
            if (Math.abs(pair.x - x) == Math.abs(pair.y - y)) {
                return true;
            }
            return false;
        }
        
        function testPosition(pair, x, y) {
            if (isQueen(pair, x, y) ||
                isInRow(pair, y) ||
                isInCol(pair, x) ||
                isInDiagonal(pair, x, y)) {
                return true;
            }
            return false;
        }
    
        function recomputeValid(validPositions, queenX, queenY) {
            var newValid = [];
            for (var i = 0; i < validPositions.length; i++) {
                if (!testPosition(validPositions[i], queenX, queenY)) {
                    newValid.push(validPositions[i]);
                }
            }
            return newValid;
        }
        
        function gridToString(grid) {
            var gridstr = '';
            for (var i = 0; i < grid.length; i++) {
                gridstr += '['
                for (var j = 0; j < grid[0].length; j++) {
                    gridstr += grid[i][j];
                    if (j < grid.length-1) {
                        gridstr += '|'
                    }
                }
                gridstr += ']'
                gridstr += '\n';
            }
            return gridstr;
        }
    
        var printGrid = function (grid) {
            console.log(gridToString(grid));
        }
    
        var solutions = [];
        $scope.solution_grids = [];
    
        function convertToSolution(grid) {
            var str = '';
            for (var i = 0; i < grid.length; i++) {
                for (var j = 0; j < grid.length; j++) {
                    if (grid[i][j] == 'Q') {
                        str += i + '#' + j;
                    }
                }
            }
            return str;
        }
        
        this.numcomputations = 0;
    
        this.solve = function (numQueens, grid, validPositions, startcol) {
            if (numQueens <= 1 && validPositions.length > 0) {
                grid[validPositions[0].x][validPositions[0].y] = 'Q';
                var solution = convertToSolution(grid);
                if (solutions.indexOf(solution) == -1) {
                    solutions.push(solution);
                    $scope.solution_grids.push(gridToString(grid));
                }
                grid[validPositions[0].x][validPositions[0].y] = '_'; //reset
                return true;
            } else if (numQueens > validPositions.length) {
                return false;
            } else {
                var x, y;
                var nextcol = validPositions.filter(function(point) {
                    return point.x == startcol + 1 ? true : false;
                });    //prune routes to only next col
                for (var i = 0; i < nextcol.length; i++) {
                    x = nextcol[i].x, y = nextcol[i].y;
                    grid[x][y] = 'Q';
                    this.solve(numQueens - 1, grid,
                               recomputeValid(validPositions, x, y), startcol+1);
                    grid[x][y] = '_';	//reset
                    this.numcomputations++;
                }
            }
        }
    }
    
    $scope.solve = function() {
        var nqueens = $scope.sides;
        var grid = new Grid(nqueens, nqueens);
        var startTime = new Date().getTime();
        console.log('Start Time: ' + startTime);
        grid.solve(nqueens, grid.getGrid(), grid.getValidPositions(), 0);
        $scope.numcomputations = grid.numcomputations;
        $scope.endTime = new Date().getTime() - startTime;
        console.log('End Time: ' + ($scope.endTime) + 'ms');
    };