Blogdex 2.0
Blogger hacked?
Mastodonte Advertising
Restructuring almost complete
RSS Syndication Fixed
linkInfo temporarily down
More on scalability
Move to MT
Blogdex finally back up
July 2003
June 2003
April 2003
March 2003
January 2003
December 2002
November 2002
October 2002
August 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
I've been spending most of my time reprogramming old interfaces to fit the new architecture. Here's a list of the changes that have taken place over the past 24 hours:
- The weblog add/verification system is working again
- The RSS 0.91 version of the top 50 is up. It can be found at http://blogdex.media.mit.edu/xml/index.asp.
- Metalinker should be working properly (and much faster as well)
I'll be working today and into the night as well. At the top of this list:
- The Social Network Explorer needs to be recoded
- Search functionality seems to be working great, I just need to build an interface
- Paging on all lists (right now every page is constrained to at most 50 results)
Welcome to the new Blogdex!
So after quite a few late nights I've decided to put up the new version of Blogdex. I hate to release partially complete and buggy code, but I've been lazy about doing the frontend and this is a way to force me to finish things. If you notice any quirky behavior, or have suggestions, please post them here.
I'll do my best to outline all of the changes and things to come:
Done
Back-end:
I spent quite a bit of time over the past few weeks designing and iterating on the back-end. Under the pressure of scaling issues, I've resolved pretty much all of the outstanding storage and memory constraints so that all of the old functionality will come back. As far as I can tell, the back end is elegant and efficient, and shouldn't need much work for the next 10 years :)
Complete database redesign, reducing overall data size from 1.5GB to just around 300MB
New data: history for all previous top 50 pages, an index of all blog-related sites, and an index of social weather for the past year
A data agent architecture, allowing for simultaneous crawling, parsing, and representation updates, which also makes adding new structure quick and easy. As of right now there are 8 agents including a weblogs.com update crawler, link parser, a few representational updaters, and a title crawler. This has been more or less in place for a few weeks, and it's running very smoothly. Right now Blogdex trails Weblogs.com in recency by about 5 minutes.
Front-end:
New design (ducking for cover)
Templated HTML construction, which makes my life much easier
A new terser top 50 page, with more information hidden in the mouseover of the link
To do
I should take care of a big chunk of this tonight, some of this is mission-critical. These are in relative order of importance:
1. Weblog add page and the RSS 0.91/1.0 pages aren't fully function. Eek!
2. Search. I was kind of wary of putting this up at all without any search capabilities. I know from my own interests that people won't be happy for long not being able to find themselves in the database.
3. History. The "a year ago today" is a small glimpse into something I am really excited about. I now have archives of every time a site was in the top 50, which gives quite a bit of context to the project.
4. Charts. Thanks to a generous gift from the people at Chart Director, I have a nice, simple graphics package for charting link behavior over time.
5. Top blogs: now that I have an index of all pages related to a given weblog, it's much easier to generate the popularity contest to end all popularity contests. I'm sure this would be priority one if the users were in control :)
6. Social weather: the data exists now to give a social weather "forecast" based on various statistics around weblog linking behavior (how many links, how convergent, how many are to blog sites, etc.).
I'm also not content with a lot of the design of the front end, so expect some major tweaking there. Please let me know if you have any suggestions.
Now I'm going to sleep. G'night!
As Anil just pointed out to me, it appears as though Blogger has been hacked. As far as I can tell, this is a very serious situation, as noted by the grave message on plasticbag. Change your server passwords immediately!
With a lot of weblogs wearing their referrers on their sleeves, the Mastodonte Advertising campaign could prove to be a big stumbling block for blogdex. Given, it remains to be seen whether or not someone will pay $1k for the service, just their initial spam shot up blogdex in a matter of hours.
I would imagine that over time, as referrer logs get more clogged with distracting messages (and don't think that Mastodonte is the first group to use this technique), there will be less of an incentive to post your referrers on your front page, and this problem will correct itself. But until then, any suggestions on how to solve this conundrum would be appreciated.
With a few more days to go before our annual research drive here at the Media Lab, I'm nearly done restructuring the backend. There might be some intermittant delays over the weekend when the final touches are instituted, and the data set is cleaned up.
The statistics for today should be back up to par, and I apologize for the funny results this morning, which were the result of yet another bug I introduced.
Here's a look at what I'm working on for next week:
1. New design, which will inevitably cause some backlash :)
2. Integrated link statistics, with more emphasis on diffusion (i.e. where a link started, where it went next, and so on), complete with charts (yay graphics!)
3. A social weather index
4. User surveys
5. Historical index of blogdex ("a year ago today...").
Believe you me, all of this stuff really is in the works, and hopefully it will all be done by Monday. I'll try and put it up in bits and pieces so that I can fend the critiques in waves.
Thanks to a note from a blogdex user, the XML syndication should be working again. I had introduced a bug where link id numbers were not being put into the href.
Due to some restructuring, the page linkInfo.asp which was provided for each link (with the text "more info") has created some inefficient database queries. I've taken the page down temporarily to alleviate the stress on the database, but plan on returning the functionality.
In the future, this page, plus the sources page will be integrated into one information-rich detail, so the data will be lost in the mean time, but back soon enough (my self-imposed deadline is Friday).
I've been grappling with data issues for the past couple of days, and for some reason was overlooking the source of the problem. The following is a plot of the spread of URLs in the blogosphere. The x-axis refers to the number of times a particular URL has been spotted (i.e. the number of weblogs it has been found on) and hte y-axis is the number of URLs in the database for that spread.

The plot continues exponentially decreasing*. The number of URLs observed on only one site is 2,224,585, while those found on at least two sites number 252,437, and three, 71,022. Without a doubt, a majority of my data (84% to be exact) lies in the form of URLs that never spread. Simply archiving these URLs that are never mentioned should reduce my data set to a tractable range. That's what I'm going to do tonight :) Hopefully then, all of the features people have come to love in blogdex will return, at least until I run up against this problem again 10 years from now**.
* for those with statistical interest, the link counts plotted on a log-log graph show a pretty convincing power-law distribution, as I have mentioned before
** I ran up against scalability issues one year after blogdex's inception, and judging by the distribution of links, this shift should reduce my dataset by an order of magnitude, meaning I should be running into the same constraints 10 years from today. With any luck, I won't still be in graduate school at that point in time.
I've moved the Blogdex weblog over to MovableType, just to be more consistent within my life. I'll get around to playing with the design when I start work on redesigning the entire site (sometime this week).
On Thursday evening, one of the fundamental Blogdex database tables acquired a glitch in its index. After spending 1 day and 4 hours repairing the index, I figured this would be a good time to move the database to a linux box, which should increase the performace considerably (since I'm using mysql, which tends to be faster on linux). 1 day and 12 hours later, the database was transferred successfully.
What does this all point to? Don't mess with the databases :) At its current size, doing any reindexing takes an absurdedly long time. In the future I'll make sure that I'm running a copy of the database in the meantime. I never expected that data management would be an issue, since all I'm storing is URLs. But after a year and some change, those URLs add up.. and now with 4 million some odd rows, the database is about 600MB.
I'm thinking about data storage now. More updates to come.
I've been spending most of my time hitting the books lately, but for the next couple of weeks I'm hot on the trail of blogdex revisions.
Today I implemented quite a bit of restructuring, namely moving everything over to a multi-agent architecture (from a disorderly set of perl scripts that were irregularly). What does this mean? Well, updates now happen on a much more regular basis. The statistics should evolve over the day as websites ping weblogs.com and random blogs are crawled.
More coming tomorrow.. hopefully adding a few crucial features (such as search and more indexes). Oh, and finally another redesign..


