Search Engine Series Part III – Contextual Search

In part II of this search series, I covered exact pattern matching.  Through a clever use of array intersection, we were able to eliminate false positives/irrelevant results when matching a user’s query.  However, that ability in itself is not very useful if context cannot be provided around the search.  In part I, I covered generating your ngram index.

In this post, I discuss how to provide context (i.e. words before and after) with each “hit”.

How to do it

Again, as with adding exact pattern matching, adding context to your search isn’t difficult.  Before, our ngram data structure keyed a trigram to a set of words matching the ngram.  However, we lost the place within the original document where the word could be found.  If we follow that approach, for every hit returned from our search, we would have to traverse the entire document for matches, then return those results.  Not good.

Instead, it’s better to keep pointers to the relevant word in the document.  Since words can be repeated per document, we can simply pass in the index of the split dictionary into a new Word object.  Here’s the class:

// Word: has a string word and an int position
var Word = function(value, position) {    
    var self = this;
  	this.value = value;
  	this.position = position;
    this.equals = function(word2) {
        return self.value == word2.value && self.position == word2.position;
    };
};

Then we need to modify the method ngramize to adhere to our new data structure.  Instead of pushing onto our array references to the word, we need to instantiate a new Word() object for each ngram match.

// NGramize a word
    function ngramize(hash, word, index) {
        for (var i = 3; i < word.length+1; i++) {
            var s = word.substring(i - 3, i);
            if (!hash.hasOwnProperty(s)) {
                hash[s] = [new Word(word, index)];
            } else {
                //todo, still eliminate dupes
                hash[s].push(new Word(word, index));
            }
        }
        return hash;
    }

Note that this method has nearly remained the same as in the previous implementation.

Finally, we now need to rewrite merge to intersect all arrays of matches based on equality of Word. To do this, I wrote a simple method called equals that lets one Word be compared to another. A Word is equivalent to another Word if the value and position are identical (i.e. they are the same reference within the document).

Finally, our merge method is rewritten to take advantage of this intersect method.

    var merge = function(arrays) {
        var returnArray = [];
        if (!arrays) return [];
        var base = arrays.pop();
        var intersected = arrays.reduce(intersect, base);
        return intersected;
    };

See how the merge method has been simplified? Awesome!

End result

The end result is instead of getting an array of strings (i.e. word matches) based on the user’s query, we now get an array of Words. Since each Word has a pointer to the word in question in the original document, all we need to do is take a slice of the original document based on how much context the user wants and join them together again (using a space).

Contextual matching with # word look-behind/ahead

For a sample implementation: see here: Search Part III: Contextual Search

Disclaimer: Again, this is just a toy implementation! To productionize properly, you may want to implement a few of the other features mentioned below.

Improving Search Further

If you’re still with me, you’ll find lot of opportunities for optimization. I haven’t added them in my toy implementation, but they wouldn’t be hard to do. As the original document increases in size, it becomes more and more necessary to implement some of these enhancements, but I’ve kept things simple for readability.

A few potential areas for improvement:

  1. Space efficiency.  Notice how we’re storing the value of the word (i.e. the string) in each Word.  That’s really wasteful!  Words can be long (e.g.: “supercalifragilisticexpialidocious”), so we can much more efficiently store pointers to words, as we store pointers to the position within the original document.
  2. Smarter precomputation of ngram index. We can sort the arrays for faster intersection!  Though our insertion time into the array in the initial computation of the ngram hash is O(1) right now (we just push it to the end of the array), we can reduce the speed of initial computation for faster recombination later on.  Read: Keeping sorted ngram arrays requires binary insertion of each new Word, something done in O(log n) time.
  3. Faster array intersection.  Example: If we store pointers to words instead of the actual word itself, this method lends itself to intersection of all arrays in O(n) time, instead of O(n^2) time.  This lends itself to an awesome speed up!

Note: I’ll implement these things at some point in the future, so you have something to look at.

And of course, on top of all these things, we could add your run of the mill features like caching, keeping your index in memory, distributing your indices, etc.

Next Up:

Finally, what else can we do to improve search?  Well, right now our search is Case-Sensitive, which isn’t good.  We’ll want to fix that.  Furthermore, our search requires exact matching and doesn’t allow for “fuzziness” in the search.  I don’t know about you guys, but I have a hard time spelling things correctly.  My fingers move too fast for the keyboard to keep up; it’s a problem stemming from the AIM era, so forgive me 🙂

We can also add contextual highlighting, within our results.  This is definitely a big visual help that lets a user know where within the search result the query is found.

See you next time!

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s