While waiting to find out if my hosting provider will change the minimum fulltext word length for MySQL, here’s what I’ve done in the meantime to deal with viable three-character search terms.
First, I split the search string into the component words (an array). I subtract any stopwords (I’ve got a big list) and for any remaining words that are under four characters long, I add to the SQL query I’m running.
Here’s the basic form of the query that I’m running, say searching for “porter”:
SELECT *, MATCH(body) AGAINST('porter') AS relevance FROM content WHERE MATCH(body) AGAINST('porter') AND [additional conditions] ORDER BY relevance DESC LIMIT 10
This uses fulltext indexing to search for “porter” with weighted relevance, and returns the appropriate content and its relevance score. Pretty straightforward, and it works really well.
Here’s what the modified query looks like, if there’s short words present, for the search “porter php”:
SELECT *, MATCH(body) AGAINST('porter') + (1 / INSTR(body, 'php') + 1 / 2[position of word in string]) AS relevance FROM content WHERE ( MATCH(body) AGAINST('porter') OR body REGEXP '[^a-zA-Z]php[^a-zA-Z]' ) AND [additional conditions] ORDER BY relevance DESC LIMIT 10
Two new things are happening. First, in the WHERE clause, I’m using both the fulltext system to find “porter” and using a regular expression search for “php.” Why REGEXP and not LIKE? Because if I write LIKE '%cow%'
for instance, I’ll not only get “cow” but also “coworker” and other wrong matches. A regular expression lets me filter those scenarios out.
That takes care of finding the words, but I also wanted to tie them into relevance, somehow. The solution I hit upon in the above SQL is relatively simple, and does the trick well enough for my tastes. Basically, the sooner the word appears in the content, the higher its relevance, which is reflected in the inverse of the number of characters “deep” in the content it appears. And I wanted to fudge the number a bit more by weighting the position of the keyword in the search string; the sooner the keyword appears, the higher the relative score it gets.
It’s not perfect, and I definitely wouldn’t recommend using this method on a sufficiently large dataset, but for my short-term needs it works just fine. The only thing really missing in the relevance factoring is how many times the keyword appeared in the content, but I can live without that for now.