Author: Jon

  • Search Patch

    While waiting to find out if my hosting provider will change the minimum fulltext word length for MySQL, here’s what I’ve done in the meantime to deal with viable three-character search terms.

    First, I split the search string into the component words (an array). I subtract any stopwords (I’ve got a big list) and for any remaining words that are under four characters long, I add to the SQL query I’m running.

    Here’s the basic form of the query that I’m running, say searching for “porter”:

    SELECT *,
    MATCH(body) AGAINST('porter') AS relevance
    FROM content
    WHERE MATCH(body) AGAINST('porter')
    AND [additional conditions]
    ORDER BY relevance DESC
    LIMIT 10

    This uses fulltext indexing to search for “porter” with weighted relevance, and returns the appropriate content and its relevance score. Pretty straightforward, and it works really well.

    Here’s what the modified query looks like, if there’s short words present, for the search “porter php”:

    SELECT *,
    MATCH(body) AGAINST('porter') +
      (1 / INSTR(body, 'php') + 1 / 2[position of word in string])
    AS relevance
    FROM content
    WHERE ( MATCH(body) AGAINST('porter')
      OR body REGEXP '[^a-zA-Z]php[^a-zA-Z]'
      )
    AND [additional conditions]
    ORDER BY relevance DESC
    LIMIT 10

    Two new things are happening. First, in the WHERE clause, I’m using both the fulltext system to find “porter” and using a regular expression search for “php.” Why REGEXP and not LIKE? Because if I write LIKE '%cow%' for instance, I’ll not only get “cow” but also “coworker” and other wrong matches. A regular expression lets me filter those scenarios out.

    That takes care of finding the words, but I also wanted to tie them into relevance, somehow. The solution I hit upon in the above SQL is relatively simple, and does the trick well enough for my tastes. Basically, the sooner the word appears in the content, the higher its relevance, which is reflected in the inverse of the number of characters “deep” in the content it appears. And I wanted to fudge the number a bit more by weighting the position of the keyword in the search string; the sooner the keyword appears, the higher the relative score it gets.

    It’s not perfect, and I definitely wouldn’t recommend using this method on a sufficiently large dataset, but for my short-term needs it works just fine. The only thing really missing in the relevance factoring is how many times the keyword appeared in the content, but I can live without that for now.

  • Searching and Minimum Word Length

    Mike Boone, in the comments section of yesterday’s entry on searching (“Updated Search“), correctly points out that searching my site for a word that is less than four characters in length (like “php” or “cow”) does not work—no results are returned. Obviously, since I write about PHP on occasion, this is untenable.

    The problem is that MySQL‘s fulltext indexing, by default, only indexes words greater than three characters long, and I don’t think I have any way to change this, despite my initial reply to Mike’s comment. This site is running on a shared server setup on pair.com, and I have absolutely zero control over the MySQL server configuration. I might post a question to their tech support, but I’m not overly optimistic about the response. So, what to do?

    Short term, here’s my solution (though it’s not implemented yet): examine each word in the search string, throwing out stopwords (like “the,” “and,” “so,” etc.), and for any word shorter than four characters long, do a LIKE search against the content for them. No, it’s not ideal, but it’s a patch. Comments?

  • Updated Search

    I’ve been vastly updating the search functionality on my site. I’m still using MySQL‘s built-in FULLTEXT indexing to perform searches, but I’ve made the results page look a lot more (okay, almost exactly like) Google‘s. The main differences are that I’m not paginating search results (yet)—all searches limit to 10 results—and that I’m showing a relevance percentage, the first result being arbitrarily determined to be a 100% relevant.

    To determine relevance, I’m relying on MySQL: a fulltext MATCH(field) AGAINST('search string') directive will return the relevance number that MySQL computes when used in the SELECT part of a query. (See MySQL Full-text Search in the online manual for detailed info on this.)

    Further plans for searching that I haven’t implemented yet: utilizing MySQL’s IN BOOLEAN MODE parameter with searching to allow advanced things like phrase searches (with quotes), required word matching (using the plus sign), and subexpressions using parentheses. It’s pretty cool stuff. Oh, and I want to be smarter about presenting excerpts: Google tries to show you content excerpts with your search terms in them, I want to be able to do the same; currently I’m just showing the first 250 or so characters of the text with HTML stripped out of it.

    And since I’m developing my whole Personal Publishing System in an open process, I’ll write up a detailed technical article soon on how to effectively use MySQL fulltext searching and show Google-like results. All real-world; the code will be cribbed right out of my search.php file.

  • St. Patrick’s Day

    Just a quick note to wish everyone (Irish or not) a happy St. Patrick’s Day today.

    Enjoy some Guinness!

  • PHP Development Hint

    Here’s a general hint for PHP development: A quick and easy way to check for syntax or compile errors without uploading the PHP script to the Web server and testing online through a browser is via the command line. It’s obvious, and I don’t know why I didn’t think of this sooner, but I’ve been doing more and more of it lately.

    I develop primarily under Windows (with PHP installed) and upload to a Unix-variant server, and this what I’ve been doing to run a PHP script on the command line on my Windows system:

    php-cli -l filename.php

    You could omit the -l option (it’s a syntax check option only) to parse and run the code, if you like. Either way, it’s an easy way to check your code without uploading it and potentially breaking your site.

  • New (Old) Design

    Just flipped the switch on the site design I wrote about (see “Everything Old is New Again“). So far things are looking good, but there might be some bugs still lurking. And right now the changes only apply to the blog pages; I haven’t reworked the ebooks page or others, yet.

    And there’s two new pages available: What is Syndication? and My Projects. The Syndication page is a sort-of FAQ on syndicating a site and RSS—a helper page, or primer page, as it were, to anyone who sees my RSS link and wonders what the hell that is. Consider it a draft, but I will be updating and maintaining that page, and aim to make it a good landing page for syndication/RSS questions.

  • Button Sites

    My post about the buttons from the other day (“Those small web buttons…“) yielded up some excellent links:

    Taylor McKnight has an amazing archive of all the known buttons, 2025 at current count. Nice. I’m not even bothering to “collect” any more.

    Kalsey Button Maker is an online app that automagically creates the buttons for you. Though with all the buttons available at the other site, this is kind of redundant—but it’s fun to fool around with.

  • Google Image Search

    Playing around with Google‘s image search, I’ve thought of some advanced search features they need to implement. Hopefully someone at Google is reading this and will get right on it ;)

    You need to be able to search by specific image dimensions (in pixels); for example, I’d like to be able to type “width:80 height:15” or maybe “dimensions:80x15” and have Google return all the images that are 80 by 15 pixels (yes, this idea is directly related to my last post on the 80×15 images). This can’t be hard; Google’s already caching the size of the image and displaying that on the search results pages, so why not be able to search them?

  • Those small web buttons…

    I’ve been noticing recently the proliferation (mostly on blog sites) of those small image files that are 80 pixels wide by 15 pixels high, are generally two-tone in color and use a simple old-school-looking font. Like these:

    RSS 2.0 button ORBlogs.com button

    And I’m wondering, what’s the story? What are they called, exactly? (I’m thinking either “buttons” or “badges.”) Who’s making them? I think they’re pretty cool, actually; clever, simple, and elegant, and a damn good graphical meme that’s working it’s way around my brain. I just haven’t been able to find out anything about them online, and I’m getting really curious.

    So I’ve started “collecting” them, saving any news ones I come across into a “badges” directory on my computer. I’ve got 42 already in two days. (Hmmm, 42. Coincidence?)

    So, what’s the scoop? Anyone know?