Category: Computers

  • Search Patch

    While waiting to find out if my hosting provider will change the minimum fulltext word length for MySQL, here’s what I’ve done in the meantime to deal with viable three-character search terms.

    First, I split the search string into the component words (an array). I subtract any stopwords (I’ve got a big list) and for any remaining words that are under four characters long, I add to the SQL query I’m running.

    Here’s the basic form of the query that I’m running, say searching for “porter”:

    SELECT *,
    MATCH(body) AGAINST('porter') AS relevance
    FROM content
    WHERE MATCH(body) AGAINST('porter')
    AND [additional conditions]
    ORDER BY relevance DESC
    LIMIT 10

    This uses fulltext indexing to search for “porter” with weighted relevance, and returns the appropriate content and its relevance score. Pretty straightforward, and it works really well.

    Here’s what the modified query looks like, if there’s short words present, for the search “porter php”:

    SELECT *,
    MATCH(body) AGAINST('porter') +
      (1 / INSTR(body, 'php') + 1 / 2[position of word in string])
    AS relevance
    FROM content
    WHERE ( MATCH(body) AGAINST('porter')
      OR body REGEXP '[^a-zA-Z]php[^a-zA-Z]'
    AND [additional conditions]
    ORDER BY relevance DESC
    LIMIT 10

    Two new things are happening. First, in the WHERE clause, I’m using both the fulltext system to find “porter” and using a regular expression search for “php.” Why REGEXP and not LIKE? Because if I write LIKE '%cow%' for instance, I’ll not only get “cow” but also “coworker” and other wrong matches. A regular expression lets me filter those scenarios out.

    That takes care of finding the words, but I also wanted to tie them into relevance, somehow. The solution I hit upon in the above SQL is relatively simple, and does the trick well enough for my tastes. Basically, the sooner the word appears in the content, the higher its relevance, which is reflected in the inverse of the number of characters “deep” in the content it appears. And I wanted to fudge the number a bit more by weighting the position of the keyword in the search string; the sooner the keyword appears, the higher the relative score it gets.

    It’s not perfect, and I definitely wouldn’t recommend using this method on a sufficiently large dataset, but for my short-term needs it works just fine. The only thing really missing in the relevance factoring is how many times the keyword appeared in the content, but I can live without that for now.

  • Searching and Minimum Word Length

    Mike Boone, in the comments section of yesterday’s entry on searching (“Updated Search“), correctly points out that searching my site for a word that is less than four characters in length (like “php” or “cow”) does not work—no results are returned. Obviously, since I write about PHP on occasion, this is untenable.

    The problem is that MySQL‘s fulltext indexing, by default, only indexes words greater than three characters long, and I don’t think I have any way to change this, despite my initial reply to Mike’s comment. This site is running on a shared server setup on, and I have absolutely zero control over the MySQL server configuration. I might post a question to their tech support, but I’m not overly optimistic about the response. So, what to do?

    Short term, here’s my solution (though it’s not implemented yet): examine each word in the search string, throwing out stopwords (like “the,” “and,” “so,” etc.), and for any word shorter than four characters long, do a LIKE search against the content for them. No, it’s not ideal, but it’s a patch. Comments?

  • Updated Search

    I’ve been vastly updating the search functionality on my site. I’m still using MySQL‘s built-in FULLTEXT indexing to perform searches, but I’ve made the results page look a lot more (okay, almost exactly like) Google‘s. The main differences are that I’m not paginating search results (yet)—all searches limit to 10 results—and that I’m showing a relevance percentage, the first result being arbitrarily determined to be a 100% relevant.

    To determine relevance, I’m relying on MySQL: a fulltext MATCH(field) AGAINST('search string') directive will return the relevance number that MySQL computes when used in the SELECT part of a query. (See MySQL Full-text Search in the online manual for detailed info on this.)

    Further plans for searching that I haven’t implemented yet: utilizing MySQL’s IN BOOLEAN MODE parameter with searching to allow advanced things like phrase searches (with quotes), required word matching (using the plus sign), and subexpressions using parentheses. It’s pretty cool stuff. Oh, and I want to be smarter about presenting excerpts: Google tries to show you content excerpts with your search terms in them, I want to be able to do the same; currently I’m just showing the first 250 or so characters of the text with HTML stripped out of it.

    And since I’m developing my whole Personal Publishing System in an open process, I’ll write up a detailed technical article soon on how to effectively use MySQL fulltext searching and show Google-like results. All real-world; the code will be cribbed right out of my search.php file.

  • PHP Development Hint

    Here’s a general hint for PHP development: A quick and easy way to check for syntax or compile errors without uploading the PHP script to the Web server and testing online through a browser is via the command line. It’s obvious, and I don’t know why I didn’t think of this sooner, but I’ve been doing more and more of it lately.

    I develop primarily under Windows (with PHP installed) and upload to a Unix-variant server, and this what I’ve been doing to run a PHP script on the command line on my Windows system:

    php-cli -l filename.php

    You could omit the -l option (it’s a syntax check option only) to parse and run the code, if you like. Either way, it’s an easy way to check your code without uploading it and potentially breaking your site.

  • Computer Languages History Timeline

    From the Computer Languages History site comes an impressive computer languages timeline chart. It’s as much a language family tree as it is a timeline. Very nice, though a little hard to read.

  • Disposable Paperboard Computer

    Pen and spiral notebookRouted via Slashdot comes the story of the disposable paperboard computer, which can “can collect, process, and exchange several pages of encrypted data.” It even has a generous 32KB of memory.

    After reading about this, I couldn’t help but thinking that we’ve already had disposable, paper-based computers around since, well, forever. It’s called pen and paper.

    And hey, if you throw in one of those sweet old-school PeeChee folders (why the hell can’t I find a web page for those things?? Other than online school supplies lists, I mean), you’ve instantly upgraded: not only your storage capacity, but processing power because you’ve got all those conversion and multiplication tables and various references at your fingertips!

  • Rasmus is the Man

    Rasmus Lerdorf, that is, the creator and godfather of PHP. He’s got an article on the Oracle Technology Network titled “Do You PHP?” that’s definitely worth a read. Here’s a sample:

    What it all boils down to is that PHP was never meant to win any beauty contests. It wasn’t designed to introduce any new revolutionary programming paradigms. It was designed to solve a single problem: the Web problem. That problem can get quite ugly, and sometimes you need an ugly tool to solve your ugly problem. Although a pretty tool may, in fact, be able to solve the problem as well, chances are that an ugly PHP solution can be implemented much quicker and with many fewer resources. That generally sums up PHP’s stubborn function-over-form approach throughout the years….

    Despite what the future may hold for PHP, one thing will remain constant. We will continue to fight the complexity to which so many people seem to be addicted. The most complex solution is rarely the right one. Our single-minded direct approach to solving the Web problem is what has set PHP apart from the start, and while other solutions around us seem to get bigger and more complex, we are striving to simplify and streamline PHP and its approach to solving the Web problem.

    The guy just oozes common sense. Here’s another bit about PHP that he wrote on the PHP-DEV mailing list about two years ago, one of my favorites that just sums up beautifully the philosophy of PHP:

    The golden rules of PHP are to keep the WTF(*) factor low and the POTFP(**) factor high.

    (*) What The Fuck
    (**) Piss Off The Fewest People

    No two ways about it: he’s one of my heroes.

  • CMS Ranting

    Gadgetopia has a good rant on content management that I’m just getting around to posting about. (CMS’s Should Manage Content, Not Display It)

    My solution was to write a function library to make raw database calls to get everything out in a nice, big, nested PHP array. I essentially built an API for the CMS to make pulling content easy, but I do all the HTML processing in PHP, abandoning completely the display side of this CMS. I still use it for administration, workflow, etc. (which it excels at), but when PHP is such a fantastic, mature language, why reinvent the wheel?

    I really don’t have anything to add to this, other than that this is largely why I favor developing my own PHP software rather than using pre-built systems—I have absolute control over the way the software works and I don’t have to rely on clunky, awkward front-end architecture and programming that I disagree with. Give me the data, and let me decide what to do with it.

  • Formatting changes

    I love templates. I was able to make some changes to the site formatting in mere minutes thanks to templates. Change two files, and it all propagates throughout the site. Lovely.

    I use a modified version of the Template class from the PHP Base Library for just about any PHP programming project I work on any more. I’ve looked into other, similar classes for PHP but haven’t really found anything that comes close to the PHP Base Library Template.

    I’ve never gotten into using Smarty largely because from what I know of it, it doesn’t fit my needs—it’s overkill for a templating system. (Caveat emptor. I could very well be wrong here.) Here’s a hint: not everything you use a template for needs to be/should be/can be compiled into PHP, which is what Smarty does. I can use my hacked Template class to build any kind of files, like my RSS file—not just PHP and HTML. Plus it’s very easy to use and it’s not burdened down with all the additional template scripting code (yeah, code) that Smarty allows.

    For my money, if you’re working with Smarty, you might as well just forego it entirely and code in native PHP. But that’s just me.

  • PHP: Best of Breed

    I’ve been meaning to write this article for a while now, mainly to point to some really good PHP applications and spread some kudos.

    There are many good applications and classes out there, but I’m limiting to those that I’ve had hands-on experience with. Even so, this is hardly a comprehensive list; I may do some follow-up articles highlighting more good PHP. (more…)