Can strings be compared fuzzily?
It’s Thanksgiving week and you’re putting the finishing touches on your new e-commerce site, just in time for Black Friday. You’ve just gotten the product search functionality working, so you kick back in your chair and relax a little, dreaming of turkey and mashed potatoes.
When suddenly, you remember that turkey makes people sleepy! How’re your customers supposed to search for products on your site if they’re half-awake, groggily misspelling and mis-remembering the names of the things they want?
Fuzzy-compare to the rescue!
String compare functions (SCFs) are the core of any search function, allowing you to find matches to an arbitrary query. SCFs take two strings as arguments:
- a
needle
, and - a
haystack
A normal SCF will return true if and only if the needle
is exactly identical to the haystack
. This type of SCF is very unforgiving; a single spelling error in the needle
will render your code blind to the user’s intended quarries.
In contrast, a fuzzy SCF will return true even if the needle
and haystack
are not exactly identical; they will return true as long as the two strings are more similar than a certain similarity threshold. This type of SCF will forgive spelling mistakes and typos, allowing even the most drowsy of users to find what they seek. Unfortunately, they are also inherently heavier than a normal SCF, and if you are not careful, will slow your site’s search down to a unbearable crawl.
There are many ways to define how “similar” two strings are to each other. Most methods try to calculate the edit distance between them, but these usually carry heavy computation costs and sometimes involve advanced information theory. You need something fast and light. A particular programmer called Vyacheslav Egorov once encountered the same exact problem, and the following algorithm is derived from his genius solution:
// Mike Wang | 2017 | MIT license
function strcmpFuzzy(needle, haystack, threshold) {
var h_len = haystack.length;
var n_len = needle.length;
if (n_len > h_len) {
return false;
}
else if (n_len === h_len) {
return (needle === haystack);
}
else {
var h = 0;
outer: for (var n = 0; n < n_len; n++) { // loop through each char in needle
var n_ch = needle.charCodeAt(n);
var j = 1;
while (h < h_len) { // loop through each char in haystack, picking up where last outer loop left off
if (haystack.charCodeAt(h) === n_ch) {
continue outer; // break inner loop and start new outer loop
}
if (j > threshold) {
return false; // only occurs if threshold or more haystack characters were skipped at once
}
j++; // increments occur only if chars do not match
h++;
}
return false; // only occurs once inner loop has reached end of haystack
}
return true; // only occurs if each needle char found a haystack match
}
}
This function will return true
if and only if:
- each character in the
needle
also exists in thehaystack
, AND - the characters in the
needle
exist in the same order in thehaystack
, AND - no more than
threshold
characters are skipped in thehaystack
at once
So, for example:
strcmpFuzzy('hard drives', 'hard drives', 3); // true
strcmpFuzzy('had drives' , 'hard drives', 3); // true
strcmpFuzzy('had dives' , 'hard drives', 3); // true
strcmpFuzzy('had dies' , 'hard drives', 3); // true
strcmpFuzzy('had des' , 'hard drives', 3); // false
strcmpFuzzy('had des' , 'hard drives', 4); // true
strcmpFuzzy('lard hives' , 'hard drives', 3); // false
strcmpFuzzy('dard hrives', 'hard drives', 3); // false
There! With a fuzzy SCF added to your site’s product search, you finally can rest easy that your customers will be able to find the products they desire, even after a third helping of turkey. At least, until they break out the wine…