Author: Jon

  • Data Mining the Web

    An interesting article today on MSNBC titled “Online search engines lift cover of privacy“, and the “InfoPorn” section of February’s Wired (can’t find a link, sorry) highlighting identity theft motivated me to write about a topic I’ve been thinking about for a while now: data mining the Web.

    The article talks about the absurd amount of information that is freely available on the Web, and how much of it is accessible through Google—and then calls using Google to find this data “Google hacking.” I think a more accurate term would be Google mining—there’s really no mad hacker v00d00 ski11z involved, and let’s face it, being able to run a realtime query against a massive database containing billions of pieces of information is really the essence of data mining.

    What got me thinking about mining the Web? Most recently, social networking software, and the data such software collects from its users. As I’ve written before, what a useful social networking system will do (among other things) is allow you to crawl the relationships among people and be able to drill-down by varying degrees into their data/life/online platform. But you know, you can already essentially do this with nothing more than a Web browser; it all goes back to the fact that there is an absurd amount of information freely and publicly available on the Web—much of it cheerfully self-published by people who should know better.

    Example? Resumes. You’ve all seen them; half the personal sites out there have an online resume page, and you can find at least 45,300 more by searching Google for “resume.doc”. On average, they contain a shocking amount of personal information: what schools you went to, and when; who employed you, and when; your address and phone number; your skills; sometimes your Social Security number. Tip of the iceberg.

    You can find out a lot about someone simply by reading their blog. My own is no exception, I’m sure, but sometimes even I’m amazed about how much personal detail people will reveal online.

    And did you know you can search for wishlists at Amazon.com and often a user’s wishlist will also contain their birthday and the city and state in which they live? If that doesn’t work, try finding someone’s birthday on Anybirthday.com—they boast having over 130 million entries gleaned from public records.

    Here’s where it gets tricky. The MSNBC article takes an alarmist tone, and in part it’s right to do so: companies and people that leave sensitive documents published on a crawler-accessible Web page are in danger of having their privacy violated. However, a lot of the information that’s out there is already public information, or information that’s freely volunteered by people and becomes public. Google is merely a tool that aggregates this information into one source. And me? Hell, I love Google, I frankly think it’s amazing. And I’m an information junkie, I salivate over the data mining possibilities—and I’ve got ideas rolling around my head on what could be done with this data, ways it can be manipulated, and linked, and so on.

    We’ve barely scratched the surface when it comes to mining the Web—I think the untapped possibilities we’re sitting on are enormous, potentially dwarfing anything we’ve previously encountered. Google is a first step.

    What’s next?

  • Content Management: Bootstrapping

    I’ve been bootstrapping the code for my Personal Publishing System (nicknamed “Spokane”) that I wrote about here and here, and since I had intended this to be an open process that I’d blog about, I’m writing up some of what I’m doing and my thoughts on how to do it. (more…)

  • Quick linking

    Here’s something interesting: my blog entry on Bend WinterFest is now the number 4 result on Google when searching for “Bend winterfest“—less than two weeks after I posted it. Damn, that’s fast.

  • Formatting changes

    I love templates. I was able to make some changes to the site formatting in mere minutes thanks to templates. Change two files, and it all propagates throughout the site. Lovely.

    I use a modified version of the Template class from the PHP Base Library for just about any PHP programming project I work on any more. I’ve looked into other, similar classes for PHP but haven’t really found anything that comes close to the PHP Base Library Template.

    I’ve never gotten into using Smarty largely because from what I know of it, it doesn’t fit my needs—it’s overkill for a templating system. (Caveat emptor. I could very well be wrong here.) Here’s a hint: not everything you use a template for needs to be/should be/can be compiled into PHP, which is what Smarty does. I can use my hacked Template class to build any kind of files, like my RSS file—not just PHP and HTML. Plus it’s very easy to use and it’s not burdened down with all the additional template scripting code (yeah, code) that Smarty allows.

    For my money, if you’re working with Smarty, you might as well just forego it entirely and code in native PHP. But that’s just me.

  • Movable Type Rant

    Great rant on Kuro5hin titled “Why your Movable Type blog must die“. Made me laugh. Worthy of Dennis Miller during his ranting heyday.

    You are all pretentious twats

    Every last one of you. You’re all latte-sipping, iMac-using, suburban-living tertiary-industry-working WASPs who offer absolutely no new insights on anything whatsoever apart from maybe one specialist field if we’re lucky. Most of you think that you’re writing original content and that you’re making a contribution by licensing your spewings under Creative Commons “Some Rights Reserved” licences, just because it’s the hip thing to do. You think you know all there is to say about blogging because you understand the concept of HTML and CSS, but the horrible truth is that 40% of you are all using the same shitty default layout. Then you take pictures of yourselves looking pensive or making vague allusions to mythology.

    Of course, I can’t claim to be much better as a blogger than some of the caricature portraits in this rant, but at least I don’t use Movable Type. :)

  • Subtle Themes

    I notice some subtle yet interesting themes cropping up in a couple of the blogs I read in the last couple of days. Joi Ito posted about having lunch with Seth Lloyd and discusses, among other things, entropy and information theory. Over on ongoing, Tim Bray publishes a photo essay on the beauty of decay and entropy.

  • Social software again

    All the hooplah over Orkut last week got me thinking more about this “social software” phenomenom from sites like Orkut and Friendster. You may remember I’ve ranted about Friendster before. My conclusions at the time were that I could see some value to it, but didn’t know what I could actually do with it.

    Several months later, same results. What do I do with this type of software? I don’t need a date. I get bored with searching for people I don’t know when all I can do is search. They’re poor at facilitating communication compared to other technologies. I already have an address book—several, actually—of people that I do know and keep in touch with. So?

    So, all of these social networking sites seem to me to be half-baked: they’re a framework built upon an interesting idea, but they’re not done yet. Honestly, I’m not even sure I can tell what the end goal is—having an interesting idea doesn’t guarantee success.

    The interesting thing about Orkut is that it’s an invitation-only service—meaning, that every user is linked to every other user in one big network—unlike Friendster or the others where there are “pockets” of networks, existing independently. Having everyone linked in some way is inherently more valuable to me; stand-alone networks diminishes the value of the system.

    But what system? Still a problem. I suppose it would be interesting to be able to crawl or browse the network of people—the big one, like Orkut does—and be able to drill-down into user data to varying degrees, based on the proximity in the network that user is to you. But there would have to be more than just user data; I’d want to drill-down into their online presence/identity/platform—the blogs, the photo galleries, the web pages and XML files of metadata, their trail of public interactions across the web (like on forums, or weblog comments)… As an example, a user browsing/crawling me would be able to drill-down into chuggnutt.com, which is becoming more and more the platform which defines my online existence. From here they could read my weblog and the archives, follow the links to any projects I’m working on (that I choose to share), see what sites and blogs I read, play with any apps I develop, etc.

    (I realize as I write this I’m also envisioning some of the online experience David Brin wrote into his near-future novel, Earth. But I haven’t read it in a long time, so I may be way off.)

    But, I can accomplish a lot of that now anyway, why another service for it? As far as I’m concerned, the real social software has been around for quite awhile now: BBSes, email, IRC, Usenet, instant messaging, weblogs. There’s more, but you get the idea.

  • Office Agoraphobia

    This week the office is being rearranged—furniture, computers, phones, etc. The net effect for me is that I get a significantly larger office space all to myself—and all that extra space is kinda freaking me out. :)

  • Bookish

    Some neat book-related links tonight. First, Locus Online has published a 2003 recommended reading list of science fiction and fantasy novels, novellas, short stories, anthologies, etc. It looks like a good list, I noticed several items coinciding with my reading wishlist.

    Next, the big news: Cory Doctorow‘s new book, Eastern Standard Tribe, is out! And, like his previous two books, he is making the novel free to download from his website. Gotta love this. Which means, over the next couple of days, I’ll download the plain text version and convert it to the Palm Reader format for my ebooks page. Don’t worry, though—you can still buy the book if you want a paper version. I sure will.

  • Pink eye

    My daughter came home today with pink eye. Apparently it’s been going around the preschool and sure enough, she’s got it. Should make for an interesting few days as we all try to avoid catching the virus.