Search Series Part VII – Autocomplete with Wild Cards

In the last post, I discussed using prefix tries for fast autocomplete of words within a dictionary.  In this post, I take it one step further and share some insights that we can glean from prefix tries on how to expand the autocomplete functionality by allowing wildcard search.  You can see the sample implementation here:

Search Part VII – Autocomplete with wild cards

If you recall, prefix trie construction required linear time for construction, O(n) time where n is the number of words inserted into the trie.  We read in each word from the dictionary, and inserted each character starting from the root.  From the prefix trie, we were able to traverse the trie for any prefix, then with DFS or BFS collect all substrings stemming from the node to which we had traversed.  In other words, we were able to quickly find all strings with a prefix of m.

What if we flipped our intuition around and wanted to find all words that contained a particular suffix?  E.g. “ing”, “tion”, “lion”?  We can use the same approach for building the prefix trie to build a suffix trie.

Suffix Trie

A suffix trie follows the same structure as a prefix trie, with nodes each containing a value of one character and a set of child nodes each pointing to a subtrie or to null.  Taking the insight from the prefix trie, we can intuitively see that word’s suffix is essentially a prefix if the word is reversed:  Hence:

“abcdefg” -> “g”, “fg”, “efg”, “defg” etc are all suffixes.  If we reverse the word,

“gfedcba”, we can construct the prefix trie out of the map of reversed words.  So in this case, “g” becomes the root, followed by “f”, “e”, “d”, and so on.

Construction of the suffix trie is trivial, once we have the code in place for constructing the prefix trie.  From our dictionary, all that is needed is to generate a dictionary using each word in reverse, then pass it into our trie constructor.

/* Example: \$dictionary starts as
* \$scope.dictionary = {
*     'aardvark': '...',
*     'actor':  '...',
* }
*/
\$scope.inverseDictionary = Object.keys(response.data).reduce(function(memo, key) {
memo[key.reverse()] = response.data[key];
return memo;
},{});
/* \$dictionary becomes:
* {
*    'kravdraa': '...',
*    'rotca': '...',
*    'niotaluda': '...'
* }
*/

\$scope.root = toTrie(Object.keys(\$scope.dictionary));
\$scope.reverseRoot = toTrie(Object.keys(\$scope.inverseDictionary));

Suffix/Prefix Trie Insights

Note that a suffix trie is essentially a subset of a “suffix tree” which contains pointers to every suffix of every word inserted into the data structure.  Compare this to the simpler approach I’ve taken here which just contains each word in the dictionary.

Now when the user types in a query into the searchbox, we can scan our prefix trie to get a set of all words in the dictionary where the query is a prefix, and simultaneously we can also retrieve a set of all words in the dictionary where the query is a suffix as well (our set of results will be reversed, so we need to reverse them again).  This extends our base functionality with allowing users to specify a wildcard (e.g. ‘*’) in the query string which we can use to scan both tries.  I won’t talk about the trie traversal here, which you can see an implementation of in my past post: Search Series Part VI – Autocomplete & Prefix Tries

Example:

Prefix trie traversal - "act*", would yield ["actor", "acting", etc.]
Suffix trie traversal - "*act", would yield ["exact", "enact", etc.]

Finally, by maintaining both a prefix trie and a suffix trie, we can implement basic intermediate wildcard functionality (up to 1 wildcard).  All we need to do is find the set of prefix results and suffix results and intersect the arrays.

As an aside: why doesn’t this approach support multiple wildcards (e.g. int*me*tion, which would yield [“intermediation”, “intermention”, etc.])?  The answer is the implementation of the suffix trie, which only contains the entire word (as opposed to all suffixes).  If we tokenize int*me*tion into [“int”, “me”, “tion”], we see that me will not return any prefix/suffix results that we expect because with our given prefix and suffix trie, we can only identify words that either begin with “me” or end with “me”, not have “me” somewhere in the middle.  To add this functionality, the implementation would need to be expanded to full blow suffix trees.  (Alternatively, we could also use something more powerful like ternary search trees).

Example:

Prefix & Suffix Tree intersection: "inter*tion", yields ["intersection", "interaction", etc]

(i.e. all matching words were inter is a prefix and tion is a suffix)

Here’s a sample implementation:

var wildcard = function(textIn) {
if (!textIn) { return; }
var substrs = textIn.split('*');
var p = textIn === '*' ? 1 : 0;
var s = textIn[textIn.length-1] === '*' ? 1 : 0;
var prefixes = substrs.length &amp;amp;gt; 1 ? substrs.slice(p, substrs.length - 1 + s) : substrs;
var suffixes = substrs.length &amp;amp;gt; 1 ? substrs.slice(1 - p, substrs.length - s) : substrs;

\$scope.results = _(prefixes).map(function(prefix) {
return \$scope.autocomplete(prefix, \$scope.root);
}).flatten().value();
\$scope.reverseResults = reverse(_(suffixes).map(function(suffix) {
return \$scope.autocomplete(suffix.reverse(), \$scope.reverseRoot);
}).flatten().value());
\$scope.intersectedResults = _.intersection(\$scope.results, \$scope.reverseResults);
};

Search Series Part VI – Autocomplete & Prefix Tries

I’ve spent a lot of the last posts demonstrating how to perform full-text fuzzy searches, highlighting, and index generation, and talked about some of the pros and cons of ngram indexing.  This post departs from search and focuses on solving a different problem: autocomplete.  As I mentioned in the first post, there is no silver bullet to search relevance. At Twitter, this means trying to tackle relevance from many angles – index generation, manual ranking of documents, query classification, etc.  Another means of getting better relevance is by trying to guide the user’s query to known queries with good sets of results.  In this post, I’ll demonstrate how to generate build out an autocomplete functionality that can be used to get the user to the right query.

See the sample implementation here: http://Search Series Part VI – Autocomplete

Why autocomplete?

This should be pretty obvious to most readers.  You’ve most likely encountered autocomplete in many places.  For example, when you type into Google’s search bar, you’ll get a list of suggested queries.  When you begin typing an address to Google maps, you’re presented with a list of possible matches.  The google-esque type of functionality probably requires a different solution altogether than the solution I’m proposing.  This post, on the other hand, describes one possible implementation of a way to provide some (but not all) autocomplete functionality.  A slightly non-obvious benefit to autocomplete is that adding autocomplete functionality not only serves as a convenience to the end user, but also is a forcing function that helps to drive search relevance.  By guiding the user to a set of searches with a set of high quality search results, one can help guide the user to those results through query modification.

Trie trie again

A trie (pronounced “try”, as in retrieval) is simply a tree-like data structure where each node in the tree can have 0 -> N leaves.  The familiar binary tree, for example, is a trie which imposes the restriction that each node has at most 2 leaves (a left and right leaf). Prefix tries on the other hand are constructed such that each path to the root is a prefix of the path to the node, where each node contains a character and a set of children.  In other words, if we created a prefix trie as such:

c &amp;amp;gt; o &amp;amp;gt; [d &amp;amp;gt; e, k &amp;amp;gt; e]

In this trie, “code” is one potential path, and each path to the root, “cod”, “co” and “c” are all prefixes of “code”.  However, not all of those prefixes are valid words, so the prefix trie must be able to distinguish between good and bad entries.  In my sample application, I use this use this data structure to efficiently store an entire dictionary, with the guarantee that there is only one path per entry (autocomplete result).  To get to the list of results that can match, I traverse the prefix trie as far as I can go with the user’s query, then use various traversal techniques to find all suffixes for a particular prefix.

Here’s a sample implementation for a prefix trie:

var Node = function(c, parent) {
var self = this;
this.c = c; // the value of the node
this.children = {}; // a map representing the children
this.parent = parent; // a pointer to the parent
this.terminal = false; // whether there is a word at this node

/* insert a word at this node */
this.insert = function(word) {
if (!word || word.length === 0) { return; }
ch = word;
if (!self.children.hasOwnProperty(ch)) {
self.children[ch] = new Node(ch, self);
}
var n = self.children[ch];
if (word.length === 1) { self.children[ch].terminal = true; }
n.insert(word.substring(1));
};
};

1) Node – Each node in my implementation has one character as the value and a set of children.  Prefix tries can be constructed with a variety of data structures to represent children for each node (e.g. arrays, linked lists, maps).  In my implementation, I opt to use hash maps for space optimality and constant speed lookup.  However, other popular implementations use arrays or sorted linked lists, which enables a fast sorting of a set of words through a preorder depth first search of the prefix trie (described below).
2) terminal: Boolean – For each word inserted, a path is created from the root node to the leaf node. Thus, it can be assumed that any path from the root node to a leaf node without children is a valid entry. However, there are valid entries that are prefixes of other entries, so there must be a way to distinguish between a valid and invalid path. Hence the terminal flag.
3) duplicates are not stored in the prefix trie (ergo, each path represents a unique result).  We could easily store the number of occurrences of each prefix in the trie.

And here are some interesting properties of prefix tries:
1) Prefix tries can be faster than hash maps for word look up – The longest path to look up a word in the trie takes O(m) time, where m is the length of the word. For a hash map, with a bad hashing function, you could take in the worst case scenario O(n) time, where n is the number of entries in the hash map.
2) Suffix detection – any path to a parent node represents a prefix of all paths through the parent node. Thus, it’s possible (and necessary) to construct an array of all suffixes stemming from the prefix and autocompleting a given result.
3) Sorting – as described above, prefix tries can be used to sort a set of words so long as the children keys are sorted. In this implementation I use hash maps, so the keys are not sorted by default (but we could opt for a different data structure).  Once the prefix trie is constructed, one only needs to traverse via DFS.

4) Prefix tries can be space efficient – Since only all unique prefixes need to be stored in the trie, we don’t need to store needless duplicate keys.

Prefix Trie Traversal

As with typical trees, one can opt for different methods of traversal to construct substrings from a parent node. I’ve provided a few different implementations below.

/* BFS */
this.traverseBFS = function(array, prefix, limit) {
var self = this;
var queue = [];
queue.push([self, prefix])
while (queue.length &amp;amp;gt; 0) {
var entry = queue;
var n = entry;
var p = entry;
if (n.terminal) {
array.push(p + n.c);
if (limit &amp;amp;gt; 0 &amp;amp;amp;&amp;amp;amp; array.length &amp;amp;gt;= limit) {
break;
}
}
for (var k in n.children) {
queue.push([n.children[k], p + n.c])
}
queue = queue.splice(1);
}
};

/* DFS */
this.traverseDFS = function(array, prefix, limit) {
var self = this;
var keys = Object.keys(self.children);
for (var i = 0; i &amp;amp;lt; keys.length; i++) { if (limit &amp;amp;gt; 0 &amp;amp;amp;&amp;amp;amp; array.length &amp;amp;gt;= limit) {
break;
}
self.children[keys[i]].traverseDFS(array, prefix + keys[i], limit);
}
if (!self.terminal || (limit &amp;amp;gt; 0 &amp;amp;amp;&amp;amp;amp; array.length &amp;amp;gt;= limit)) {
return;
}
array.push(prefix);
};

BFS (breadth first search) – With breadth-first search, as we traverse to each child node, we invoke the traverse function call for each child nodes’ children. This is equivalent to traversing by visiting all nodes at each depth before traversing each next level node. This returns matching autocomplete results in order of length. One gotcha here: since BFS uses a queue to control the order of the nodes to visit, we need a way to maintain the prefix up to each node. While each node contains a pointer to the parent, so we could conceivably traverse up to the root node, this would require up to M moves (where m is the length of the word) repeated N times (for each result), resulting in O(MN) time to construct the autocomplete list.

Alternatively, I keep track by simply enqueueing a tuple of the [Node, prefix: String] each time, such that we can construct the autocomplete list by traversing to each node once in O(N) time. DFS (depth first search) – With depth-first search, we traverse to the deepest node before starting to recurse. As described above, this method is useful for radix sort if the keys are kept in sorted order (they aren’t in my example). As with normal DFS, we can traverse the tree preorder, inorder or postorder. You can see from my example that I traverse as far left as possible until the left node is a leaf node, then I visit that node, then visit each sibling node called with the prefix up to that point. Some thoughts:

• Without any intelligence (algorithmic or otherwise), autocomplete only has limited utility in cases where the result set is large.  You can see in this instance, where there are about 109k words in the dictionary, that limiting the user to 10 results probably wouldn’t get the user to the desired query.
• This current implementation doesn’t account for mispellings in the word or any leniency (e.g. leaving characters out, adding extra characters).  However, we can combine elements of the ngram text search discussed in the first 5 parts of this series. Essentially, the first ngram of a word is also a prefix of the word, so we can use the prefix trie to do some quick filtering, then return a set of fuzzy matches.

Search Series Part V – Search Optimizations (multi-word fuzzy match and highlighting)

In the previous search engine series, I demonstrated how to implement a naive fuzzy match using Levenshtein distance.  I call it naive because the distance metric is based on matching the user’s query to matching words in the dictionary.  In today’s blog post, I show how to extend that functionality across multiple words, taking into account spatial locality (an overloaded term) regardless of where they occur in the original document.  I also show you how to add contextual highlighting of the user query of the search results.

Foreword: I’ve also done some refactoring to clean up the code a bit.  I had begun writing the text search example using custom methods, but decided for readability sake to extend Array.prototype for some useful functionality.  Another library out there that does pretty much the same array transformations I need is lodash, but I opted not to include it (no shortcuts!).  However, where certain functions are implemented in standard javascript (ES5), I’ve used those (e.g. Array.prototype.filter).  I think the code reads more cleanly now.

Multi-word fuzzy matches

There’s no real secret to multi-word matches.  For those keeping up with this series, if you look at the code, you’ll notice that it all builds on the last iteration.  Multi-word search is no different.  Previously, I keyed the ngram index using a list of ngrams formulated from the user’s query.  Our search method followed this pattern:

user query > array of user query ngrams > array of fuzzy query ngrams > array potential matches > sorted array of word matches

See this image below (which I spent way too long creating): You can see from this diagram that when the user entered in multiple words, we would never get good matches unless the words were adjacent.  The reason is simple: Two different words will never have intersecting potential matches.  While they could have potential fuzzy matches, this result is erroneous signal.

The easy and obvious fix for this is to split the user’s query by word, then conduct a fuzzy search through the index using all ngrams associated with those words.  Much like we did for fuzzy searches with fuzzy ngrams, we move up one level of abstraction to N searches.  This returns an array of array of Words, so we’ll need to flatten the array and then arrange them somehow.

Here is the sample implementation for this:
Note: as described above, I rewrote some of the methods to make this easier to read (less nested parentheses)

function getRankedWords(word) {
var ngrams = Object.keys(ngramize({}, word, 0));
var fuzzyNGrams = ngrams.map(keyOnFuzzy)
.flatten()
.unique()
.filter(function(ngram) {
return ngrams.indexOf(ngram) === -1;
}); //only take unique fuzzy ngrams
var exact = ngrams.map(keyOnHash);
var fuzzy = fuzzyNGrams.map(keyOnHash);
return rank(exact, fuzzy, word) || [];
};

if (!textIn || textIn.length &lt; 3) return [];
return textIn.split(' ').map(getRankedWords).flatten();

Ranking result sets

Since each search returns a set of exact and fuzzy matches, this implementation necessitates (as it did for fuzzy matches) another ranking scheme to arrange the results in order of relevance to the user.  In my implementation, I opted for a quick and dirty solution of ranking via spatial locality (I know this is an overloaded term related to caching schemes).  Essentially, based on the number of words of context, if any two results fall within the range, they form the same “result.”  Note that since we are no longer matching 1 word at a time, I created a new class called a Result, which contains the string matching the search result’s context, gave it a new score (just an inverse distance calculation) and also added highlighting to the context.

// Result: has a context and score
var Result = function(context, score, highlighted) {
this.context = context;
this.score = score;
this.highlighted = highlighted;
};

With this, we define the interface for a search result, and are abstracted away from words.  You can see how using abstractions like this can be very powerful.  Instead of words, we may create indices that point to documents.  Then for each document, we can reference the correct index.  In this manner, we can have distributed indices across a cluster and get faster lookups than storing all indices in memory on one machine.

Here’s a sample implementation:

// Rank sets of results
// Algorithm
// Traverse through flattened results, if the D (delta)
// between adjacent positions i &lt; \$context, combine results
// and aggregate scores (i.e. more relevant)
var rankSets = function(sets) {
function score(distance) { return distance === 0 ? 1 : 1/(distance + 1); };
var sorted = sets.sort(Word.prototype.sortByPosition);
if (!sorted || sorted.length === 0) { return []; }
var prev = sorted;
var results = [];
for (var i = 1; i &lt; sorted.length; i++) {
var curr = sorted[i];
var _score = score(prev.distance);
var lowPosition = prev.position;
var highPosition = lowPosition;

while (curr.position - prev.position &lt;= \$scope.context) { //spatially close
i++;
highPosition = curr.position;
_score += score(curr.distance);
prev = curr;
curr = sorted[i];
}
var _context = \$scope.dict.slice(lowPosition - \$scope.context,
highPosition + \$scope.context).join(' ');
results.push(new Result(_context, _score, highlight(\$scope.textIn, _context)));
prev = curr;
}
return results.sort(Result.prototype.sortByScore);
};

You can see that the ranking algorithm is simple.  I start by sorting the array of Words by position in the document.  Recall that position in a Word is a pointer to its location in the document.  Since we already computed a distance metric per Word compared with the corresponding word in the user’s query, we can reuse that computation to form a “score.”

The score method itself is also very simple.  It’s just an inverse distance metric, such that “exact matches” return a score of 1, whereas imprecise matches return a score of 1/(distance +1).

Finally, as I traverse through the array of Words, when I find two words where the distance of the positions (in words) is less than the context, I increment the high and low position that was used to compute the context within the document.  Then I aggregate the scores of the adjacent Words.  If a Word has no neighbors within that range, the most recent positions are used to compute the Result.  In the case where the Word is isolated, this just returns the same context as before.  Where a Word had neighbors within range, this returns a context with a max range of (context * 2 + (highPosition – lowPosition)) words.  We could truncate this to make all contexts the same size (but I opted not to).

Finally, since we’ve been iterating through our results and combining Words to make new contexts, we have to sort the set by score.

Highlight all the things

Highlighting is also a pretty easy add.  When I instantiate a Result and push it into our results array, I have all the information I need to compute whether the result should contain highlighting or not.  Simply by breaking apart the users query and using a string replace, we can return a html fragment ready to be rendered by the browser.

Sample implementation:

var highlight = function(textIn, context) {
return context.replace(
new RegExp("(" + textIn.split(' ').join('|') + ")", "gi"),
"<mark>\$1</mark>");
};

That’s pretty much all it requires with this implementation. Further optimizations:

• As this series is only meant to demonstrate how to implement some basic textual search techniques, it still doesn’t optimize for performance.  We are still using extra space with our ngram index (storing each word) and are using O(n^2) to intersect the arrays of matches.  It would be better to keep our indices sorted so we can have faster intersection time.
• Distributed indices would be a good idea.  I briefly mentioned this above, but we have all the tools necessary such that we can also create indices of ngrams to documents instead of words.  We can then use these ngram indices to filter out all documents that don’t match a user’s query, then use a pointer to another index to do the textual search within the document.
• Better ranking metrics.  Additionally, my ranking algorithm is very basic and is based purely on distance and spatial locality.  We could perhaps do something more useful, like score a Result differently depending on how many words apart each Word was. Also, because we use reuse the context variable, our maximum lookup range is small and likely wouldn’t even cover a sentence.

Here’s the full implementation so you can play around with it: Search Engine Series V – Multi-word matches and highlighting

Search Engine Series IV – Fuzzy Matching

In the last part of the series, I introduced the concept of adding context around search results using pointers to the reference in the original document.  While this works well for exact pattern matching, it fails for scenarios where the user may have mistyped, misspelled or doesn’t type out the entire query.  In this entry, I’ll explain how to add some basic fuzzy matching to your full text search.

Why Fuzzy Search?

Essentially what we’ve implemented in parts I – III of the search engine series is a less powerful version of grep. Cool in some ways, but we’re trying to achieve a different goal here.  Note that grep allows regular expression matching which we could implement as well.  The way this would work is rather than breaking apart the user’s query into ngrams then keying off our hash to merge the matches, we could iterate through the trigrams we found to see which matched the user’s regular expression, then merge the matches.

Instead, we are implementing a basic version of search through documents, so we want to have some fuzziness feature.  Fuzziness naturally introduces imprecision into textual search.  If we consider fuzziness a dial which we can turn up or down, In order to implement it, we also need to have a weighting function that will compare similarities between strings to determine whether we are under our fuzziness threshold.

I’m going to use a common distance metric called Levenshtein distance (but you could substitute your own).  Levenshtein distance essentially measures the number of transformations (add a character, delete a character, or change a character) to transform string A into string B. Example:

String A: abcdeeefg

String B: accdefg

String B could be converted into String A by doing the following:

abcdeeefg

Where red = deletion, blue = addition; in other words, String A and B have a Levenshtein distance of 3 (substitute first ‘c’ for ‘b’, add ‘ee’ after the first ‘e’ in String B).  Random Shoutout: you can use my other website http://www.dockumo.com to diff two strings and see this type of comparison!  Coolness!!  (note, the implementation is completely different)

Back to Levenshtein distance.  The slow implementation for computing this distance metric is fairly trivial, but conveys the point:

/* What this is doing:
* Recursively computing levenshtein distance on substrings of decreasing size.
* If the characters in the first position are equivalent, distance is 0, else
* distance is 1.  Note the prime opportunity optimizing w/dynamic programming.
*/
function levenshtein(a, b) {
if (a.length === 0) return b.length;
if (b.length === 0) return a.length;

return Math.min(
levenshtein(a.substring(1), b) + 1,
levenshtein(b.substring(1), a) + 1,
levenshtein(a.substring(1), b.substring(1)) + (a === b ? 0 : 1));
}

How to map distance to fuzziness

This is all well and good, but how do we use Levenshtein distance for matching?  Here’s where the science gets a little imprecise (at least in my mind).  There are really two places where we can use Levenshtein distance to compare the search query with our result set.  1 – as we break apart the user’s query into ngrams, we can find other ngrams with a distance < fuzziness, and 2 – when we return a set of results that may be close but no cigar, we determine whether the matching word and the query have a distance < fuzziness.  In my implementation, I do both.

Fuzzy Ngrams

We have to be careful with 1, because depending on the size of our ngram (trigrams in our case), if the fuzziness is set too high, we may just end overinclusive of the fuzzy ngram matches against which to find our result set.  Proof: Any ngram has a max Levenshtein distance with another ngram of equal size of n.  So in my naive implementation, I only select ngrams where the distance <= 1 (note that since our trigrams form a set, the distance will never == 0).

It behooves us to precompute the hash of trigrams to their fuzzy matches, because the alternative would be iterating through all keys in the trigram hash -something very expensive to do in real time.

var computeFuzz = function(hash, fuzziness) {
var ngrams = Object.keys(hash);
\$scope.fuzzMap = ngrams.reduce(function(fuzzMap, key) {
fuzzMap[key] = select(ngrams, function(k) {
return getEditDistance(k, key) <= fuzziness;
});
return fuzzMap;
}, {});
};

Fuzzy Matches

Then, when merging the results, we need to determine how to sort our results.  It turns out we can also use Levenshtein distance here to rearrange the results in an order that makes sense, but this ranking can be whatever you want.  My implementation simply stores and sorts by Levenshtein distance to the user’s query.  If the distance is greater than some fuzziness factor, we can just throw out the result.  Note: In order to speed up the sorting, since we know exact matches are going to be the most precise (yet may have a large distance because the matches are substrings), we always push those to the front of the results and sort the remainder of the results returned from the fuzzy search.

Additionally, because fuzzy matches may overlap with exact pattern matches, we have to transform the search results into a set (i.e. call unique() on the result set).

var rank = function(exact, fuzzy, textIn) {
var sorted = unique(select(flatten(fuzzy).map(function(word) {
return new Word(word.value, word.position, getEditDistance(word.value, textIn));
}), function(word) {
return word.distance <= \$scope.fuzziness;
}), Word.prototype.uniqueWord).sort(Word.prototype.sortByDistance);
return unique(merge(exact).concat(sorted), Word.prototype.uniqueWord);
};

Implementation

In my implementation, I added a few details for precomputation/speed.

• In order to figure out what fuzzy trigram matches there are, after I build the initial trigram index, I iterate through the trigrams creating a second map keyed on the trigram to its fuzzy matches.
• Example: “abc” -> [“abd”, “bbc”, …]
• After splitting the user’s query into trigrams, I get the set of trigrams pertaining to a fuzzy match, then filter out the non-fuzzy trigrams.
• I use a faster implementation of Levenshtein distance that takes advantage of dynamic programming.
• When reordering the results, I map a new distance from each potential fuzzy match to the original search query.  I only do this for fuzzy matches, to reduce computation time.

See the full implementation here: Search Part IV: Fuzzy Search

Disclaimer: As usual, things could be sped up and optimized, but I tried to keep things simple (e.g., using a simple distance calculation and simple ranking algorithm) for readability. Fuzzy matching sample. Try playing around with the fuzziness tolerance. Also: I’m getting fancy with CSS. Watch out!

Considerations

• There is a large variety of distance formulae out there.  You might get better results with one or another.
• Trigrams are pretty standard for full text search, but if fuzziness is a real concern, then it might be worth trying with quadgrams and a distance of two, however, I suspect that there would just be too many quadgrams and the search would slow down.

Search Engine Series Part III – Contextual Search

In part II of this search series, I covered exact pattern matching.  Through a clever use of array intersection, we were able to eliminate false positives/irrelevant results when matching a user’s query.  However, that ability in itself is not very useful if context cannot be provided around the search.  In part I, I covered generating your ngram index.

In this post, I discuss how to provide context (i.e. words before and after) with each “hit”.

How to do it

Again, as with adding exact pattern matching, adding context to your search isn’t difficult.  Before, our ngram data structure keyed a trigram to a set of words matching the ngram.  However, we lost the place within the original document where the word could be found.  If we follow that approach, for every hit returned from our search, we would have to traverse the entire document for matches, then return those results.  Not good.

Instead, it’s better to keep pointers to the relevant word in the document.  Since words can be repeated per document, we can simply pass in the index of the split dictionary into a new Word object.  Here’s the class:

// Word: has a string word and an int position
var Word = function(value, position) {
var self = this;
this.value = value;
this.position = position;
this.equals = function(word2) {
return self.value == word2.value &amp;amp;&amp;amp; self.position == word2.position;
};
};

Then we need to modify the method ngramize to adhere to our new data structure.  Instead of pushing onto our array references to the word, we need to instantiate a new Word() object for each ngram match.

// NGramize a word
function ngramize(hash, word, index) {
for (var i = 3; i < word.length+1; i++) {
var s = word.substring(i - 3, i);
if (!hash.hasOwnProperty(s)) {
hash[s] = [new Word(word, index)];
} else {
//todo, still eliminate dupes
hash[s].push(new Word(word, index));
}
}
return hash;
}

Note that this method has nearly remained the same as in the previous implementation.

Finally, we now need to rewrite merge to intersect all arrays of matches based on equality of Word. To do this, I wrote a simple method called equals that lets one Word be compared to another. A Word is equivalent to another Word if the value and position are identical (i.e. they are the same reference within the document).

Finally, our merge method is rewritten to take advantage of this intersect method.

var merge = function(arrays) {
var returnArray = [];
if (!arrays) return [];
var base = arrays.pop();
var intersected = arrays.reduce(intersect, base);
return intersected;
};

See how the merge method has been simplified? Awesome!

End result

The end result is instead of getting an array of strings (i.e. word matches) based on the user’s query, we now get an array of Words. Since each Word has a pointer to the word in question in the original document, all we need to do is take a slice of the original document based on how much context the user wants and join them together again (using a space). For a sample implementation: see here: Search Part III: Contextual Search

Disclaimer: Again, this is just a toy implementation! To productionize properly, you may want to implement a few of the other features mentioned below.

Improving Search Further

If you’re still with me, you’ll find lot of opportunities for optimization. I haven’t added them in my toy implementation, but they wouldn’t be hard to do. As the original document increases in size, it becomes more and more necessary to implement some of these enhancements, but I’ve kept things simple for readability.

A few potential areas for improvement:

1. Space efficiency.  Notice how we’re storing the value of the word (i.e. the string) in each Word.  That’s really wasteful!  Words can be long (e.g.: “supercalifragilisticexpialidocious”), so we can much more efficiently store pointers to words, as we store pointers to the position within the original document.
2. Smarter precomputation of ngram index. We can sort the arrays for faster intersection!  Though our insertion time into the array in the initial computation of the ngram hash is O(1) right now (we just push it to the end of the array), we can reduce the speed of initial computation for faster recombination later on.  Read: Keeping sorted ngram arrays requires binary insertion of each new Word, something done in O(log n) time.
3. Faster array intersection.  Example: If we store pointers to words instead of the actual word itself, this method lends itself to intersection of all arrays in O(n) time, instead of O(n^2) time.  This lends itself to an awesome speed up!

Note: I’ll implement these things at some point in the future, so you have something to look at.

And of course, on top of all these things, we could add your run of the mill features like caching, keeping your index in memory, distributing your indices, etc.

Next Up:

Finally, what else can we do to improve search?  Well, right now our search is Case-Sensitive, which isn’t good.  We’ll want to fix that.  Furthermore, our search requires exact matching and doesn’t allow for “fuzziness” in the search.  I don’t know about you guys, but I have a hard time spelling things correctly.  My fingers move too fast for the keyboard to keep up; it’s a problem stemming from the AIM era, so forgive me 🙂

We can also add contextual highlighting, within our results.  This is definitely a big visual help that lets a user know where within the search result the query is found.

See you next time!

Search Engine Series Part II – Exact pattern matching

In Part 1 of the series (here: https://codeandcodes.com/2015/10/04/search-engine-series-basic-ngram-indexing/), I described how to construct ngram indices for a body of text.  We saw however, that ngram matching while quick, is imprecise because it matches too greedily.  By parsing the query also into ngrams, we ended up getting matches that the user would not have been looking for.

In this part II, I discuss how to achieve exact pattern matching.  Relevance is more than a function of matching a query to a list of results.  There is a psychological component.  The gist of surfacing relevant data thus means returning results that are in the realm of expectations for a user.  In other words, while a result set can be filtered in a technical sense to a set of results to return, it’s no good if a technical match is out the realm of expectation for the user.  For example, using a simple weighting function like Levenshtein distance may yield a technically high scoring result, but if the minimal string transformations yield an unexpected result, the result is no good (e.g. “and” and “end”).  We can minimize the effect by naively enforcing exact pattern matching.

Algorithm

In part I, parsing the users query and piping it through the ngram map yielded an array of matching words to any ngram. We stripped the array of duplicates and returned the entire set to the user. However, we saw many false positives returned along with the good results. How do we eliminate those false positives?  It turns out that achieving exact pattern matching is pretty simple and elegant. Each ngram array returned (e.g. for the query “governmen”) included an array of results that could be potential matches. In order to eliminate extraneous non-matches (like “fundamentally” or “settlement”) which match on “men”, we simply need to take the intersection of all result arrays.  This yields a precise match against the string.

var merge = function(arrays) {
var returnArray = [];
if (!arrays) return [];
var matches = {};
arrays.forEach(function(array) {
array.reduce(function(matches, word) {
if (!matches.hasOwnProperty(word)) {
matches[word] = 1;
} else {
matches[word] += 1;
}
return matches;
}, matches);
});

//If the word intersects all arrays
returnArray = select(Object.keys(matches), function(word) {
return matches[word] === arrays.length;
});
return returnArray;
};

Proof

Why does this work?  Essentially, with our prior greedy approach, as we parsed the user’s query and filtered our hash, keying on each trigram that was created, we were left with an array of arrays of all potential matches.  While this represented the complete set of matches, this also created situations where to qualify as a “result”, the word only needed to contain any trigram in the query.  Thus, each word within our filtered results represented in a partial match of the user’s query.  Ergo, to match a user’s query exactly, each word must match every ngram within the user’s query.  How did we find the list of words matching all trigrams within the user’s query?  Simple, just intersect the arrays.

See the implementation here: Part II: Exact Pattern Matching

Usual disclaimer applies: Keeping it simple for the sake of explanation, could be refactored, yadda yadda yadda.

Now we’re getting somewhere.  However, our search is still incomplete.  It’s not useful to simply return the word that matches the user’s query.

• We want to return some context (i.e. the words before and after) the exact pattern match so the user can determine whether the result is useful or not.
• We can also optimize our data structure for computeNGram.  Right now we store only unique matches per ngram, which was OK when determining if there is an exact match.  However, in order to store context, we can no longer quash duplicates.
• We can optimize performance as well in order to still provide fast matches in the above bullet.

Stay tuned for part III.

Search Engine Series Part I – Basic ngram indexing

At Twitter, I’ve been spending a lot of time thinking about search engines and how to effectively index and surface data.  When it comes down to it, indexing methods mean a lot, but so do other some of “brute force” factors like:

1. the quality of the data indexed
2. the amount of data indexed
3. transforming search queries through an effective DSL
4. surfacing and ranking search results

Search has no silver bullet

Perhaps I’ve been learning more and more that there’s no magic bullet to search.  There’s several methods to indexing volumes of data, e.g. string tokenization, ngram tokenization; there’s levels of strictness of search, e.g. exact matches, fuzzy matches, partial matches; then there’s transformation of the search query itself, e.g. what indices to search, what parts of a query must be included in search results, fuzzyness matching and mispelling.

In this next series of posts, I’ll be describing some search basics that may be useful to the reader.

Computing NGram indices

Ngram indices are a good starting point for creating a searchable index of words.  I’ve included an example implementation here: Part 1: Constructing a Trigram index

Disclaimer: In this example, I use a toy implementation and trigrams for the sake of simplicity.  For the dictionary “corpus”, I use the Declaration of Independence.  The code examples are meant to only illustrate the basic implementation details and aren’t meant for production use (however, the code will be refined over time).

Constructing ngram indices is very simple – essentially one constructs a map of n-character length substrings from a dictionary of words, where each ngram maps onto the array of matching full words.  Different techniques can then be applied on top of this, including

1. Scoring functions – for example, for the number of matching words in a document, take a basic distance b (where b is the # characters that match)
2. Fuzzy matching – replace one character for each character of the trigram to generate a list of other potential matches
3. Context – keep pointers to place in document to determine the contextual sentence

// Compute NGrams
var computeNGrams = function(dict) {
return dict.split(' ').reduce(function(hash, word) {
return ngramize(hash, word);
}, {});
};

// NGramize a word
function ngramize(hash, word) {
for (var i = 3; i < word.length+1; i++) {
var s = word.substring(i - 3, i);
if (!hash.hasOwnProperty(s)) {
hash[s] = [word]
} else {
if (hash[s].indexOf(word) == -1) {
hash[s].push(word)
}
}
}
return hash;
}

Algorithm

You can see that creation is simple:  Essentially, you just pass in a word of length i, traverse through the word creating all substrings of length N, where N is the “N” in “NGram”.  In this case, I’ve chosen to use trigrams, or NGrams of length 3.

As the user types in his/her query, we break that query apart into ngrams as well, and match against our ngram map (i.e. the corpus of data).  This gives us an array of arrays of matching words per query.  Then we simply flatten the array, remove duplicates and present the results to the user.

Example: