Search Series Part VII – Autocomplete with Wild Cards

In the last post, I discussed using prefix tries for fast autocomplete of words within a dictionary.  In this post, I take it one step further and share some insights that we can glean from prefix tries on how to expand the autocomplete functionality by allowing wildcard search.  You can see the sample implementation here:

Search Part VII – Autocomplete with wild cards

If you recall, prefix trie construction required linear time for construction, O(n) time where n is the number of words inserted into the trie.  We read in each word from the dictionary, and inserted each character starting from the root.  From the prefix trie, we were able to traverse the trie for any prefix, then with DFS or BFS collect all substrings stemming from the node to which we had traversed.  In other words, we were able to quickly find all strings with a prefix of m.

What if we flipped our intuition around and wanted to find all words that contained a particular suffix?  E.g. “ing”, “tion”, “lion”?  We can use the same approach for building the prefix trie to build a suffix trie.

Suffix Trie

A suffix trie follows the same structure as a prefix trie, with nodes each containing a value of one character and a set of child nodes each pointing to a subtrie or to null.  Taking the insight from the prefix trie, we can intuitively see that word’s suffix is essentially a prefix if the word is reversed:  Hence:

“abcdefg” -> “g”, “fg”, “efg”, “defg” etc are all suffixes.  If we reverse the word,

“gfedcba”, we can construct the prefix trie out of the map of reversed words.  So in this case, “g” becomes the root, followed by “f”, “e”, “d”, and so on.

Construction of the suffix trie is trivial, once we have the code in place for constructing the prefix trie.  From our dictionary, all that is needed is to generate a dictionary using each word in reverse, then pass it into our trie constructor.

/* Example: $dictionary starts as  
* $scope.dictionary = {
*     'aardvark': '...',
*     'actor':  '...',
*     'adulation': '...'
* }
*/
        $scope.inverseDictionary = Object.keys(response.data).reduce(function(memo, key) {
            memo[key.reverse()] = response.data[key];
            return memo;
        },{});
/* $dictionary becomes:
* {
*    'kravdraa': '...',
*    'rotca': '...',
*    'niotaluda': '...'
* }
*/

    	$scope.root = toTrie(Object.keys($scope.dictionary));
        $scope.reverseRoot = toTrie(Object.keys($scope.inverseDictionary));

Suffix/Prefix Trie Insights

Note that a suffix trie is essentially a subset of a “suffix tree” which contains pointers to every suffix of every word inserted into the data structure.  Compare this to the simpler approach I’ve taken here which just contains each word in the dictionary.

Now when the user types in a query into the searchbox, we can scan our prefix trie to get a set of all words in the dictionary where the query is a prefix, and simultaneously we can also retrieve a set of all words in the dictionary where the query is a suffix as well (our set of results will be reversed, so we need to reverse them again).  This extends our base functionality with allowing users to specify a wildcard (e.g. ‘*’) in the query string which we can use to scan both tries.  I won’t talk about the trie traversal here, which you can see an implementation of in my past post: Search Series Part VI – Autocomplete & Prefix Tries

Example:

Prefix trie traversal - "act*", would yield ["actor", "acting", etc.]
Suffix trie traversal - "*act", would yield ["exact", "enact", etc.]

Finally, by maintaining both a prefix trie and a suffix trie, we can implement basic intermediate wildcard functionality (up to 1 wildcard).  All we need to do is find the set of prefix results and suffix results and intersect the arrays.

As an aside: why doesn’t this approach support multiple wildcards (e.g. int*me*tion, which would yield [“intermediation”, “intermention”, etc.])?  The answer is the implementation of the suffix trie, which only contains the entire word (as opposed to all suffixes).  If we tokenize int*me*tion into [“int”, “me”, “tion”], we see that me will not return any prefix/suffix results that we expect because with our given prefix and suffix trie, we can only identify words that either begin with “me” or end with “me”, not have “me” somewhere in the middle.  To add this functionality, the implementation would need to be expanded to full blow suffix trees.  (Alternatively, we could also use something more powerful like ternary search trees).

Example:

Prefix & Suffix Tree intersection: "inter*tion", yields ["intersection", "interaction", etc]

(i.e. all matching words were inter is a prefix and tion is a suffix)

Here’s a sample implementation:

var wildcard = function(textIn) {
        if (!textIn) { return; }
        var substrs = textIn.split('*');
        var p = textIn[0] === '*' ? 1 : 0;
        var s = textIn[textIn.length-1] === '*' ? 1 : 0;
        var prefixes = substrs.length > 1 ? substrs.slice(p, substrs.length - 1 + s) : substrs;
        var suffixes = substrs.length > 1 ? substrs.slice(1 - p, substrs.length - s) : substrs;
        
        $scope.results = _(prefixes).map(function(prefix) {
            return $scope.autocomplete(prefix, $scope.root);
        }).flatten().value();
        $scope.reverseResults = reverse(_(suffixes).map(function(suffix) {
            return $scope.autocomplete(suffix.reverse(), $scope.reverseRoot);
        }).flatten().value());
        $scope.intersectedResults = _.intersection($scope.results, $scope.reverseResults);
    };
Example wildcard search
Example wildcard search (inter*tion)
example wildcard search
Example wildcard search (am*ing)
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s