Blogdex not updating

As of Tuesday afternoon, one of the main storage machines at the Media Lab had a major hard drive failure, and everything had to be restored from tape. Unfortunately, one of the major parts of Blogdex runs from this hard drive, and can't be restored until it's back up (I put it on that drive for safe keeping. ugh.)

The good news is that it should be running again tonight if my sysadmins give me the right information. I'll post again to let you know that it's up and running. Thanks for understanding.

Blogdex adds gzip support

Blogdex uses the Apache::ASP perl-script as a scripting engine to build its various dynamic pages. While browsing the configuration variables for another project I'm working on, I noticed a peculiar setting I hadn't seen before: CompressGzip.

By switching one variable sitewide, most pages on Blogdex will now be sent compressed. It's not such a big issue for me since MIT provides me with nearly unlimited bandwidth. However, with a nominal amount of CPU all of Blogdex's content should transfer about 4 times faster than before (I benchmarked the compression using the ISAPIZip online tool). Enjoy.

Downtime

Over the past few days Blogdex has remained silent. I didn't know this because I've been at a conference without (gasp) any sort of network access. After a bit of debugging I realized that the weblog cache table of the database had become corrupt, so nothing was coming in or out. Things should be back to normal now.

On Blogdex exploitability

Peter Caputa has written a little piece on what he believes is a flaw in the Blogdex ranking technique:

www.whizspark.com/blog

Peter is completely right, this technique can be used to drive a post to the top of the index (as he has so aptly shown). Since the day Blogdex debuted, people have been devising ways to exploit the ranking algorithm. And truth be told, most of them work :) As of late, a few weblog widgets have been facilitating a regular sort of Blogdex exploitation. Sites that automatically post various pieces of information to the front of their weblog (comments, referrers, etc.) allow for outsiders to manipulate the front of their sites, and in turn, the front page of Blogdex.

Of course the time weighted nature of the index means that while any one person can control the index, they can't control it for long. Every link that gets to the top of the index only stays there for a day at most, and furthermore I'm constantly looking for sites that look out of the ordinary. Even though the type of exploit Peter points out has existed for years, and been taken advantage of a number of times, it hasn't really drawn much attention. I thought about trying to build in some sort of system to detect sites with referral links on their front pages, but truth be told, it hasn't really warranted that much work yet.

The downside of a completely transparent system is its manipulability, but this is of course also what makes it trustworthy. Sans comment spam (which is largely a product of a weblog software exploit, not Blogdex), there hasn't really been a loophole which has continually affected the index. Blogdex is your sandbox... play around and figure it out. Break it and I'll fix it... then I'll study you and get a Ph.D. ;)

Announcing Tracking Syndication

I've added two new RSS features allowing people to track both sites and weblogs via a simple RSS feed. Check the sidebar for links to various RSS feeds.

Link Diffusion

Any link's diffusion can now be tracked using the page /xml/track.asp. First find the link diffusion page by searching for the link, or entering it manually on the same search page.

Weblog Tracking

Much more exciting than the standard link diffusion is the new weblog tracking page. Instead of tracking links independently, for any weblog currently in Blogdex you can now track all links to any subsite therein. For instance, my weblog:

http://overstated.net/

will show links for anything under that url, including the alias for www.overstated.net. Give it a try and let me know what you think. To find a weblog, as above, search for the url and click "track this weblog."

Feedback would be greatly appreciated.

Domain name change

As you might notice, every URL on blogdex is now being rewritten to the new url for Blogdex:

http://blogdex.net

This is due in large part to a kind gift from Jimmy Wales, the former owner, to whom we are extremely thankful.

The reason that blogdex.net was chosen is that MIT has a policy against the use of .com domain names for servers run on the MIT network. This is an informal policy created to uphold the policies on commercial activity at MIT (as noted in Section 13.2.3 of MIT's policies and procedures. It is my opinion that the .com domain does not connote commercial activity anymore, but after bringing this to the attention of the MIT network administrator, I received no response.

If this policy changes, I might switch over to blogdex.com, but for the time being everything will point to blogdex.net. I'm sure most people will be thankful for the change.

MySql error

Blogdex was down from 4:30pm until 7:30pm tonight because MySql had all of its sockets open. Restarting the database solved the problem, but I'm checking to see why the error occured.

Infrastructure changes

Over the next couple of weeks Blogdex will be undergoing some backend upgrades. Some machines have become free for my use here, so technically speaking we'll finally be in the year 2001. The following transitions will be happening:

Web server: Moving from a dual PIII-500 running Win2K and IIS to a dual PIII-1GHz running Linux/Apache.

DB server: Moving from a dual PII-200 running Linux/MySQL to a dual PIII-500 running Linux/MySQL with faster disks. This should be a significant upgrade.

The upgrade to Apache is already complete, so if you're seeing Blogdex.media.mit.edu as 18.85.45.85, you're accessing it from the new server. The transition from IIS perlscript was amazingly seemless, requiring a few simple query-replaces to change from IIS ASP syntax to Apache ASP syntax. I should be moving the database server sometime tomorrow.

Taking the garbage out

I apologize for the number of useless links today. I have been working on the backend recently, trying to make it more reliable, and have somehow introduced a bug that is causing it to parse old, dead weblogs as if they were new. I'm correcting for the error, and the index should be back to normal momentarily.

RSS and Uniqueness

One of the biggest problems with RSS for a dynamic site like Blogdex is that the content of individual entities will change over time. Over the course of a day, the description for a site in the top 10 may change up to 30 times as different weblogs link to the site and affect the ways in which it is being described.

This means that over many downloads of the Blogdex RSS feed, certain persistent links will be replicated many times. RSS 2.0 provides a facility to specify the identity of an item (it's guid), so that despite changing title and description, RSS readers should only keep one instance (probably the most recent). Some people have suggested a similar element for RDF, but no RSS readers I know of utilize this record:uuid element.

I am extremely busy this week, but I have changed the RSS 0.91 to be RSS 2.0 with the guid element in the mean time. As soon as my exams are over (April 2), I'll provide a facility for RSS 0.91, 2.0 and RDF. Please let me know if you have any suggestions/problems, as I'm excited to start providing more blogdex RSS features.