Searching and Minimum Word Length

Mike Boone, in the comments section of yesterday’s entry on searching (“Updated Search“), correctly points out that searching my site for a word that is less than four characters in length (like “php” or “cow”) does not work—no results are returned. Obviously, since I write about PHP on occasion, this is untenable.

The problem is that MySQL‘s fulltext indexing, by default, only indexes words greater than three characters long, and I don’t think I have any way to change this, despite my initial reply to Mike’s comment. This site is running on a shared server setup on pair.com, and I have absolutely zero control over the MySQL server configuration. I might post a question to their tech support, but I’m not overly optimistic about the response. So, what to do?

Short term, here’s my solution (though it’s not implemented yet): examine each word in the search string, throwing out stopwords (like “the,” “and,” “so,” etc.), and for any word shorter than four characters long, do a LIKE search against the content for them. No, it’s not ideal, but it’s a patch. Comments?

Comments

5 responses to “Searching and Minimum Word Length”

  1. Mike Boone Avatar

    I am building a search engine for a job wesbsite and ran into this problem last summer. They were on a shared hosting arrangement at the time, and I couldn’t adjust the character length below 4. I tried doing LIKE searches for = 4.0.1. Most ISPs I know are still running 3.23.x.

    Most of what I’ve learned so far has been trial-by-fire, so if you go the custom route, I’d be interested in any techniques or resources you dig up.

  2. Mike Boone Avatar

    Hmmm…I wrote a long comment and somehow the middle got sucked out. 🙁

    Maybe your comment code doesn’t like one of the characters I put in? I had a greater-than character, which came right before the = siqn where the comment got truncated in the middle.

  3. Jon Avatar

    Well, I’m doing a PHP strip_tags() on the comment text, so maybe if it saw a less-than and then a greater-than sign, it thought it was an HTML tag and stripped it…? Sorry about that.

  4. Mike Boone Avatar

    I must have used a less-than sign earlier in the text and it all got sucked out as a big HTML tag. Maybe use htmlspecialchars or htmlentities instead of strip_tags?

    Anyway, what I wrote was that I also had the problem of the 4 character word size last summer while a client’s database was on a shared server. I decided to build my own search engine, loosely using a method described here:

    http://jeremy.zawodny.com/blog/archives/000576.html#comment-1613

    (I also used your stemmer class instead of indexing the actual words)

    The nice thing about this method is that it gives you total control over the search. You define what the words look like (e.g., are C++ or C# searchable words?).

    The bad think is that you have code everything: Code to add items to the index, update them, or remove them. Code to take the search string and put it into the proper SQL for the search.

    After doing all this I have a lot of respect for the people who built and maintain the Google search engine.

    As I said above, most of what I’ve learned so far has been trial-by-fire, so if you go the custom route, I’d be interested in any techniques or resources you dig up.

    (BTW, the BOOLEAN MODE stuff only works in MySQL 4, so if your web host uses 3.23, you can’t use it).

  5. Jon Avatar

    MySQL on the server is 4, so I can use IN BOOLEAN MODE.

    Thanks for the pointer, it’s an interesting approach. I’ve been thinking about rolling my own search functionality using similar methods: stem keywords from content, store them into a table with pointers back to the content, and search against them. I’d also calculate the frequency of keywords and match that against all the rest, to determine relevance… but it’s a lot of work, as you pointed out. I’ve done similar stuff before.

    As it is, I worked up a patch for the problem last night, and it works well enough for right now… I’ll blog about that solution a bit later. In the meantime, I’ll see if I can get my provider to lower the word length to 3.