Search Engine Series IV – Fuzzy Matching

In the last part of the series, I introduced the concept of adding context around search results using pointers to the reference in the original document.  While this works well for exact pattern matching, it fails for scenarios where the user may have mistyped, misspelled or doesn’t type out the entire query.  In this entry, I’ll explain how to add some basic fuzzy matching to your full text search.

Why Fuzzy Search?

Essentially what we’ve implemented in parts I – III of the search engine series is a less powerful version of grep. Cool in some ways, but we’re trying to achieve a different goal here.  Note that grep allows regular expression matching which we could implement as well.  The way this would work is rather than breaking apart the user’s query into ngrams then keying off our hash to merge the matches, we could iterate through the trigrams we found to see which matched the user’s regular expression, then merge the matches.

Instead, we are implementing a basic version of search through documents, so we want to have some fuzziness feature.  Fuzziness naturally introduces imprecision into textual search.  If we consider fuzziness a dial which we can turn up or down, In order to implement it, we also need to have a weighting function that will compare similarities between strings to determine whether we are under our fuzziness threshold.

I’m going to use a common distance metric called Levenshtein distance (but you could substitute your own).  Levenshtein distance essentially measures the number of transformations (add a character, delete a character, or change a character) to transform string A into string B. Example:

String A: abcdeeefg

String B: accdefg

String B could be converted into String A by doing the following:

abcdeeefg

Where red = deletion, blue = addition; in other words, String A and B have a Levenshtein distance of 3 (substitute first ‘c’ for ‘b’, add ‘ee’ after the first ‘e’ in String B).  Random Shoutout: you can use my other website http://www.dockumo.com to diff two strings and see this type of comparison!  Coolness!!  (note, the implementation is completely different)

Back to Levenshtein distance.  The slow implementation for computing this distance metric is fairly trivial, but conveys the point:

/* What this is doing:
* Recursively computing levenshtein distance on substrings of decreasing size.
* If the characters in the first position are equivalent, distance is 0, else 
* distance is 1.  Note the prime opportunity optimizing w/dynamic programming.
*/
function levenshtein(a, b) {
    if (a.length === 0) return b.length;
    if (b.length === 0) return a.length;
    
    return Math.min(
        levenshtein(a.substring(1), b) + 1,
        levenshtein(b.substring(1), a) + 1,
        levenshtein(a.substring(1), b.substring(1)) + (a[0] === b[0] ? 0 : 1));
}

How to map distance to fuzziness

This is all well and good, but how do we use Levenshtein distance for matching?  Here’s where the science gets a little imprecise (at least in my mind).  There are really two places where we can use Levenshtein distance to compare the search query with our result set.  1 – as we break apart the user’s query into ngrams, we can find other ngrams with a distance < fuzziness, and 2 – when we return a set of results that may be close but no cigar, we determine whether the matching word and the query have a distance < fuzziness.  In my implementation, I do both.

Fuzzy Ngrams

We have to be careful with 1, because depending on the size of our ngram (trigrams in our case), if the fuzziness is set too high, we may just end overinclusive of the fuzzy ngram matches against which to find our result set.  Proof: Any ngram has a max Levenshtein distance with another ngram of equal size of n.  So in my naive implementation, I only select ngrams where the distance <= 1 (note that since our trigrams form a set, the distance will never == 0).

It behooves us to precompute the hash of trigrams to their fuzzy matches, because the alternative would be iterating through all keys in the trigram hash -something very expensive to do in real time.

    var computeFuzz = function(hash, fuzziness) {
        var ngrams = Object.keys(hash);
	$scope.fuzzMap = ngrams.reduce(function(fuzzMap, key) {
            fuzzMap[key] = select(ngrams, function(k) {
                return getEditDistance(k, key) <= fuzziness;
            });
            return fuzzMap;
        }, {});
    };

Fuzzy Matches

Then, when merging the results, we need to determine how to sort our results.  It turns out we can also use Levenshtein distance here to rearrange the results in an order that makes sense, but this ranking can be whatever you want.  My implementation simply stores and sorts by Levenshtein distance to the user’s query.  If the distance is greater than some fuzziness factor, we can just throw out the result.  Note: In order to speed up the sorting, since we know exact matches are going to be the most precise (yet may have a large distance because the matches are substrings), we always push those to the front of the results and sort the remainder of the results returned from the fuzzy search.

Additionally, because fuzzy matches may overlap with exact pattern matches, we have to transform the search results into a set (i.e. call unique() on the result set).

var rank = function(exact, fuzzy, textIn) {
        var sorted = unique(select(flatten(fuzzy).map(function(word) {
            return new Word(word.value, word.position, getEditDistance(word.value, textIn));
        }), function(word) {
            return word.distance <= $scope.fuzziness;
        }), Word.prototype.uniqueWord).sort(Word.prototype.sortByDistance);
        return unique(merge(exact).concat(sorted), Word.prototype.uniqueWord);
    };

Implementation

In my implementation, I added a few details for precomputation/speed.

  • In order to figure out what fuzzy trigram matches there are, after I build the initial trigram index, I iterate through the trigrams creating a second map keyed on the trigram to its fuzzy matches.
    • Example: “abc” -> [“abd”, “bbc”, …]
  • After splitting the user’s query into trigrams, I get the set of trigrams pertaining to a fuzzy match, then filter out the non-fuzzy trigrams.
  • I use a faster implementation of Levenshtein distance that takes advantage of dynamic programming.
  • When reordering the results, I map a new distance from each potential fuzzy match to the original search query.  I only do this for fuzzy matches, to reduce computation time.

See the full implementation here: Search Part IV: Fuzzy Search

Disclaimer: As usual, things could be sped up and optimized, but I tried to keep things simple (e.g., using a simple distance calculation and simple ranking algorithm) for readability.

Fuzzy Matching example
Fuzzy matching sample. Try playing around with the fuzziness tolerance. Also: I’m getting fancy with CSS. Watch out!

Considerations

  • There is a large variety of distance formulae out there.  You might get better results with one or another.
  • Trigrams are pretty standard for full text search, but if fuzziness is a real concern, then it might be worth trying with quadgrams and a distance of two, however, I suspect that there would just be too many quadgrams and the search would slow down.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s