Note: This post is based on part 1 which you should read first, otherwise you might be lost.
Fuzzy search
We talked about how to create Wicket autocomplete dropdowns in the previous post. Now we want to improve the returned suggestions by adding fuzzy search. Returning the expected results even when the user has made a slight spelling mistake in their search term is known as fuzzy search. For example, suppose the user enters “Wicked” as the search term, we still want to return “Wicket” as a result if it’s in our data set.
You will see this kind of behavior all the time in search engines, however, the approach we will discuss here is much less complex 馃檪
Levenshtein distance
One such algorithm for doing this is called Levenshtein distance. The “distance” part of the name is referring to the distance between the search term and the string you want to match to. The distance is defined by how many insertions/deletions/substitutions of characters you need to make to the search term in order to reach the target string.
In our earlier example, the distance between the search string “Wicked” and the target string “Wicket” is 1 because we only need to make one substitution, replace “d” with “t”. So we can see that a distance of 1 is a very close match, and the greater the distance, the worse of a match it is.
MySQL implementation
A big thanks to Jon LaBelle for providing the implementation, it was used only with minor changes. You can find the script for creating the functions here.
You will find 3 MySQL function definitions:
- levenshtein_match_all – This will take in our search term, the column to search against, and the maximum distance.聽 If the distance is above the max, the result will not be returned.聽 It will take our search term, split it into words, and for each word, try to match it.聽 This is the function that will actually be called by our app.
- levenshtein_match – This will take in a single word from the above function and try to match it to each word in the target string.
- levenshtein – This is the main function which calculates the distance between two words.
Changes to autocomplete component
The main change to our SearchService is going to be how to retrieve the suggestions. We first try to search using our normal string matching approach. If there are no results, than we try the fuzzy approach. The reason we do this is because fuzzy search can be much slower than doing a regular LIKE string search.
We also add a max distance parameter to our AutocompleteFilters class. This way we can adjust how strict/loose the fuzzy matching will be.
One thing to note in relation to Hibernate: In order to use our MySQL functions from HQL, we need to define a custom dialect. In the class CustomMysqlDialect, we register a Hibernate function which maps to the MySQL function.
Performance limitations
Unfortunately now for the bad news… The approach presented here is somewhat slow and should be used for relatively small data sets. The good news is there are ways to improve it:
- Jozef Jarosciak has written about using the SOUNDEX function.聽 The idea would be to filter out potential matches by first using the fast soundex function, and then apply Levenshtein distance.
- A few people have talked about using native compiled user-defined functions written in C++ that perform much faster than SQL functions.
Wicket test page
As always you can find a full working app to play around with on GitHub. Enjoy!