Search Engine Series Part II – Exact pattern matching

In Part 1 of the series (here: https://codeandcodes.com/2015/10/04/search-engine-series-basic-ngram-indexing/), I described how to construct ngram indices for a body of text.  We saw however, that ngram matching while quick, is imprecise because it matches too greedily.  By parsing the query also into ngrams, we ended up getting matches that the user would not have been looking for.

In this part II, I discuss how to achieve exact pattern matching.  Relevance is more than a function of matching a query to a list of results.  There is a psychological component.  The gist of surfacing relevant data thus means returning results that are in the realm of expectations for a user.  In other words, while a result set can be filtered in a technical sense to a set of results to return, it’s no good if a technical match is out the realm of expectation for the user.  For example, using a simple weighting function like Levenshtein distance may yield a technically high scoring result, but if the minimal string transformations yield an unexpected result, the result is no good (e.g. “and” and “end”).  We can minimize the effect by naively enforcing exact pattern matching.

Algorithm

In part I, parsing the users query and piping it through the ngram map yielded an array of matching words to any ngram. We stripped the array of duplicates and returned the entire set to the user. However, we saw many false positives returned along with the good results. How do we eliminate those false positives?  It turns out that achieving exact pattern matching is pretty simple and elegant. Each ngram array returned (e.g. for the query “governmen”) included an array of results that could be potential matches. In order to eliminate extraneous non-matches (like “fundamentally” or “settlement”) which match on “men”, we simply need to take the intersection of all result arrays.  This yields a precise match against the string.

var merge = function(arrays) {
        var returnArray = [];
        if (!arrays) return [];
        var matches = {};
        arrays.forEach(function(array) {
            array.reduce(function(matches, word) {
                if (!matches.hasOwnProperty(word)) {
                    matches[word] = 1;
                } else {
                    matches[word] += 1;
                }
                return matches;
            }, matches);
        });
        
        //If the word intersects all arrays
        returnArray = select(Object.keys(matches), function(word) {
            return matches[word] === arrays.length;
        });
        return returnArray;
    };

Proof

Why does this work?  Essentially, with our prior greedy approach, as we parsed the user’s query and filtered our hash, keying on each trigram that was created, we were left with an array of arrays of all potential matches.  While this represented the complete set of matches, this also created situations where to qualify as a “result”, the word only needed to contain any trigram in the query.  Thus, each word within our filtered results represented in a partial match of the user’s query.  Ergo, to match a user’s query exactly, each word must match every ngram within the user’s query.  How did we find the list of words matching all trigrams within the user’s query?  Simple, just intersect the arrays.

See the implementation here: Part II: Exact Pattern Matching

Usual disclaimer applies: Keeping it simple for the sake of explanation, could be refactored, yadda yadda yadda.

Much better matching
Much better matching

Now we’re getting somewhere.  However, our search is still incomplete.  It’s not useful to simply return the word that matches the user’s query.

  • We want to return some context (i.e. the words before and after) the exact pattern match so the user can determine whether the result is useful or not.
  • We can also optimize our data structure for computeNGram.  Right now we store only unique matches per ngram, which was OK when determining if there is an exact match.  However, in order to store context, we can no longer quash duplicates.
  • We can optimize performance as well in order to still provide fast matches in the above bullet.

Stay tuned for part III.

Advertisement

One thought on “Search Engine Series Part II – Exact pattern matching

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s