More Updates

I've been spending most of my time reprogramming old interfaces to fit the new architecture. Here's a list of the changes that have taken place over the past 24 hours:

- The weblog add/verification system is working again

- The RSS 0.91 version of the top 50 is up. It can be found at http://blogdex.media.mit.edu/xml/index.asp.

- Metalinker should be working properly (and much faster as well)

I'll be working today and into the night as well. At the top of this list:

- The Social Network Explorer needs to be recoded

- Search functionality seems to be working great, I just need to build an interface

- Paging on all lists (right now every page is constrained to at most 50 results)

Posted by cameron on October 30, 2002 at 02:10 PM
Blogdex 2.0

Welcome to the new Blogdex!

So after quite a few late nights I've decided to put up the new version of Blogdex. I hate to release partially complete and buggy code, but I've been lazy about doing the frontend and this is a way to force me to finish things. If you notice any quirky behavior, or have suggestions, please post them here.

I'll do my best to outline all of the changes and things to come:

Done

Back-end:

I spent quite a bit of time over the past few weeks designing and iterating on the back-end. Under the pressure of scaling issues, I've resolved pretty much all of the outstanding storage and memory constraints so that all of the old functionality will come back. As far as I can tell, the back end is elegant and efficient, and shouldn't need much work for the next 10 years :)

Complete database redesign, reducing overall data size from 1.5GB to just around 300MB

New data: history for all previous top 50 pages, an index of all blog-related sites, and an index of social weather for the past year

A data agent architecture, allowing for simultaneous crawling, parsing, and representation updates, which also makes adding new structure quick and easy. As of right now there are 8 agents including a weblogs.com update crawler, link parser, a few representational updaters, and a title crawler. This has been more or less in place for a few weeks, and it's running very smoothly. Right now Blogdex trails Weblogs.com in recency by about 5 minutes.

Front-end:

New design (ducking for cover)

Templated HTML construction, which makes my life much easier

A new terser top 50 page, with more information hidden in the mouseover of the link

To do

I should take care of a big chunk of this tonight, some of this is mission-critical. These are in relative order of importance:

1. Weblog add page and the RSS 0.91/1.0 pages aren't fully function. Eek!

2. Search. I was kind of wary of putting this up at all without any search capabilities. I know from my own interests that people won't be happy for long not being able to find themselves in the database.

3. History. The "a year ago today" is a small glimpse into something I am really excited about. I now have archives of every time a site was in the top 50, which gives quite a bit of context to the project.

4. Charts. Thanks to a generous gift from the people at Chart Director, I have a nice, simple graphics package for charting link behavior over time.

5. Top blogs: now that I have an index of all pages related to a given weblog, it's much easier to generate the popularity contest to end all popularity contests. I'm sure this would be priority one if the users were in control :)

6. Social weather: the data exists now to give a social weather "forecast" based on various statistics around weblog linking behavior (how many links, how convergent, how many are to blog sites, etc.).

I'm also not content with a lot of the design of the front end, so expect some major tweaking there. Please let me know if you have any suggestions.

Now I'm going to sleep. G'night!

Posted by cameron on October 29, 2002 at 09:26 AM
Blogger hacked?

As Anil just pointed out to me, it appears as though Blogger has been hacked. As far as I can tell, this is a very serious situation, as noted by the grave message on plasticbag. Change your server passwords immediately!

Posted by cameron on October 25, 2002 at 11:36 AM
Mastodonte Advertising

With a lot of weblogs wearing their referrers on their sleeves, the Mastodonte Advertising campaign could prove to be a big stumbling block for blogdex. Given, it remains to be seen whether or not someone will pay $1k for the service, just their initial spam shot up blogdex in a matter of hours.

I would imagine that over time, as referrer logs get more clogged with distracting messages (and don't think that Mastodonte is the first group to use this technique), there will be less of an incentive to post your referrers on your front page, and this problem will correct itself. But until then, any suggestions on how to solve this conundrum would be appreciated.

Posted by cameron on October 25, 2002 at 11:32 AM
Restructuring almost complete

With a few more days to go before our annual research drive here at the Media Lab, I'm nearly done restructuring the backend. There might be some intermittant delays over the weekend when the final touches are instituted, and the data set is cleaned up.

The statistics for today should be back up to par, and I apologize for the funny results this morning, which were the result of yet another bug I introduced.

Here's a look at what I'm working on for next week:

1. New design, which will inevitably cause some backlash :)

2. Integrated link statistics, with more emphasis on diffusion (i.e. where a link started, where it went next, and so on), complete with charts (yay graphics!)

3. A social weather index

4. User surveys

5. Historical index of blogdex ("a year ago today...").

Believe you me, all of this stuff really is in the works, and hopefully it will all be done by Monday. I'll try and put it up in bits and pieces so that I can fend the critiques in waves.

Posted by cameron on October 10, 2002 at 03:04 PM
RSS Syndication Fixed

Thanks to a note from a blogdex user, the XML syndication should be working again. I had introduced a bug where link id numbers were not being put into the href.

Posted by cameron on October 10, 2002 at 02:40 PM
linkInfo temporarily down

Due to some restructuring, the page linkInfo.asp which was provided for each link (with the text "more info") has created some inefficient database queries. I've taken the page down temporarily to alleviate the stress on the database, but plan on returning the functionality.

In the future, this page, plus the sources page will be integrated into one information-rich detail, so the data will be lost in the mean time, but back soon enough (my self-imposed deadline is Friday).

Posted by cameron on October 08, 2002 at 11:26 PM
More on scalability

I've been grappling with data issues for the past couple of days, and for some reason was overlooking the source of the problem. The following is a plot of the spread of URLs in the blogosphere. The x-axis refers to the number of times a particular URL has been spotted (i.e. the number of weblogs it has been found on) and hte y-axis is the number of URLs in the database for that spread.

The plot continues exponentially decreasing*. The number of URLs observed on only one site is 2,224,585, while those found on at least two sites number 252,437, and three, 71,022. Without a doubt, a majority of my data (84% to be exact) lies in the form of URLs that never spread. Simply archiving these URLs that are never mentioned should reduce my data set to a tractable range. That's what I'm going to do tonight :) Hopefully then, all of the features people have come to love in blogdex will return, at least until I run up against this problem again 10 years from now**.

* for those with statistical interest, the link counts plotted on a log-log graph show a pretty convincing power-law distribution, as I have mentioned before
** I ran up against scalability issues one year after blogdex's inception, and judging by the distribution of links, this shift should reduce my dataset by an order of magnitude, meaning I should be running into the same constraints 10 years from today. With any luck, I won't still be in graduate school at that point in time.

Posted by cameron on October 07, 2002 at 05:02 PM
Move to MT

I've moved the Blogdex weblog over to MovableType, just to be more consistent within my life. I'll get around to playing with the design when I start work on redesigning the entire site (sometime this week).

Posted by cameron on October 06, 2002 at 09:30 PM
Blogdex finally back up

On Thursday evening, one of the fundamental Blogdex database tables acquired a glitch in its index. After spending 1 day and 4 hours repairing the index, I figured this would be a good time to move the database to a linux box, which should increase the performace considerably (since I'm using mysql, which tends to be faster on linux). 1 day and 12 hours later, the database was transferred successfully.

What does this all point to? Don't mess with the databases :) At its current size, doing any reindexing takes an absurdedly long time. In the future I'll make sure that I'm running a copy of the database in the meantime. I never expected that data management would be an issue, since all I'm storing is URLs. But after a year and some change, those URLs add up.. and now with 4 million some odd rows, the database is about 600MB.

I'm thinking about data storage now. More updates to come.

Posted by cameron on October 06, 2002 at 05:23 PM
back on the trail

I've been spending most of my time hitting the books lately, but for the next couple of weeks I'm hot on the trail of blogdex revisions.

Today I implemented quite a bit of restructuring, namely moving everything over to a multi-agent architecture (from a disorderly set of perl scripts that were irregularly). What does this mean? Well, updates now happen on a much more regular basis. The statistics should evolve over the day as websites ping weblogs.com and random blogs are crawled.

More coming tomorrow.. hopefully adding a few crucial features (such as search and more indexes). Oh, and finally another redesign..

Posted by cameron on October 02, 2002 at 08:12 PM