new press
timeliness and recency
more features! more features! (ok, coming)
doubles begone!
userland sites added!
userland weblogs
backlogged (oh no)
new data server
fluctuations in the number of sites
July 2003
June 2003
April 2003
March 2003
January 2003
December 2002
November 2002
October 2002
August 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
the media lab has been experiencing some major network problems over the past 24 hours, which makes blogdex slow to load, and precluded the crawler from effectively doing its job last night. this has something to do with the fact that the lab maintains more networking hardware from more vendors than an isp. thank god it's not my job to make sure everything is up and running.
the network services here has told me that they've identified the problem and should have it resolved shortly. in the mean time, i'm planning on moving locations with the machine while things are bad, so the currently shotty service might turn into currently non-existent service for a few minutes. after that, you shoudn't notice a difference.
sorry for all of the confusion. i was just ready to roll out a few new features when i was hit with this yesterday. now i have a migrane instead.
a couple of articles were published yesterday about blogdex:
one in usa today generally about weblogs which mentioned blogdex. i think that it's a pretty good introduction to weblogs which will hopefully spur some more mass media coverage. it was written by janet kornblum who runs a column called ebriefing.
another in search day profiled blogdex in the context of the search engine community. i don't typically think of blogdex as a search engine per se, so it's interesting to see it covered from that perspective. the author is chris sherman, an associate editor at search engine watch.
i know that some people are not happy with the amount of publicity that blogdex has gotten, since it's only a reasearch project. at the risk of bothering those people more, i might make a page detailing both the mass media and weblog coverage, since i think the juxtaposition is interesting.
for those of you that looked at blogdex over the weekend, i was very impressed with the timeliness provided for the breaking news of this cnn story about aaliyah's fatal plane crash. considering that the story was posted at 1am EDT, and blogdex starts its crawl at 3am EDT, i find it pretty amazing that the blogger community linked to this story quickly enough to make it number 4 on blogdex by 5am EDT. very alert indeed.
i'll be working hard this week to bring two new features to blogdex:
1. more description: lots of people have expressed an interest in looking at something other than urls. i'll be adding two levels of description, one being the actual title of pages that are linked to, and the other will be a contextual passage from each linker's site.
2. finally some search: i'd say that half of the traffic here is from people looking for their webpage among the many thousands indexed by blogdex. biz stone's blogdexter provides a simple service that uses my current system to provide a similar service. i'll be adding a little bit more functionality, hopefully something that can bring you directly to the page in totals where your website is. i've been trying to figure this problem out for a while, and the answer has been evading me. there has to be a sneaky sql query to do exactly what i want.. i just need to find it.
yesterday i made the realization that all of my trials and tribulations with looking for duplicate sites is a very easily solved problem. all that needed to be done was to calculate a checksum for each site, and compare them to each other. i installed the md5 algorithm into the crawling process, and this morning, lo and behold, there were 212 sites with duplicate checksums!
this means that in the long run, doing site security will be much simpler than it currently is. instead of manually picking through each new site, i should be able to detect duplicates automatically. now all i need is a "blog detector" and i can take a vacation.
after communications with the userland webmaster, the crawler of blogdex (blogbot) is now allowed to crawl userland sites daily.
however, this introduced a bug in the statistics (note that manila.userland.com is #1 today with some 80 points). this is because these sites have never been crawled, as blogbot has been banned since the beginning of the project. this problem can be easily fixed, so the stats will be adjusted as soon as i get to work.
by the time you read this, hopefully it won't make any sense.
for those that read wesly felter's post on his weblog hack the planet, he raised the issue of his weblog not being included in the count here at blogdex. i wasn't aware of the problem, but i pursued it as soon as i read his mention.
the problem resulted from my interaction with the servers at userland. apparently, i've been hitting their machines too rapidly/too much, and firing some automatic response. i've emailed them, and hope to resolve things soon.
i've been a very very very bad administrator.
somehow i've let the tally of sites-added-but-not-yet-checked get up above 500 (usually it hovers around 100). i really need to engineer a more robust and automatic system for adding new sites.
but in the mean time, i've got to pound on the list--i'm making it my task today to get through as many as possible. if you've been waiting to be crawled for quite some time now, never fear. i'll get to you. as long as my will holds up, people's voices will be counted.
i've just moved the data server to a much speedier box--what was pIII with 128 mb ram is now a dual pIII w/ 512 mb ram. this should bring the speed up to a level at which i'm comfortable with adding new features. plus the old features should be faster and more fun. definitely more fun.
you may have noticed a large fluctuation in the number of total sites indexed, namely a huge jump last week from around 8,500 to 13,000. this resulted from a re-adding of sites that were marked as 'offline.' every time the crawler receives an error related to a site, that site is marked as temporarily offline.
i had forgotten to re-crawl these sites since, well, the beginning of the project (oops). surprise, surprise, when i crawled them, most were actually still up. this just goes to show how tenuous web data is. i've started checking the offline sites every two days or so, which should stabilize the number of sites.
due to a bug in my crawler, the all-time totals were not being updated as new links were added. they have been corrected, and now the source statistics and the all-time totals should match.. to the link.
today on the front of blogdex, i've introduced a new index. temporarily, it will act as the standard index, but soon i should have the old one up and running again.
two effects in the former listings prompted me to do some development:
1. sampling over a 24 hour period tended to lose sites that may have been popular, but not common enough to break into the range of 3 links/day.
2. "common," or all-time top links are not being filtered out by the current system. this is because only the top 100 are being looked at, an algorithm which doesn't scale and isn't good at all (quick hacks don't always work best).
to solve these issues, the new mass index looks over a longer period (6 days) and considers the percentage of each link that occurs in the given time period. for instance, if 95% of google.com's links occur before the past week, it shouldn't be anywhere near the top of the list.
of course memes will tend to linger if this model is used. if a meme really takes off, it could be on the top of the chart for over a week. to account for this, i am weighting today more than yesterday, yesterday more than the day before, and so on.
that's about it: more days, weighting the recency within that timeframe, and looking at the history of the link outside that timeframe. i'm tired, and it's late, so i'll forgo a more formal analysis for the time being. i'll make sure and put up real live mathematical equations and stuff later.
good night.. enjoy the links.
wacky brit rolled out his own meta blog tool yesterday, which is crawling some 1000 blogger sites. blogdex is a simple idea, and a simple system, so i'm sure that he built it in pretty much the same fashion as i did.
my interest is not spawned by the system itself, but more what was mentioned in reference on tim thompson's site -- that this new blogdex-like system will have the source released as soon as it is finished.
at the risk of sounding defensive, i'm not sure that this is such a good idea. instead of open-sourcing the system, what needs to happen is an open-sourcing of the content. i personally would release my code if anyone was interested in it, which could be explained easily in a few sentences. however, if everyone started running their own crawlers to produce the same content as myself, we would run up against a problem with bandwidth produced by these spiders.
all of this replication is unnecessary if all we are interested in is the same content. originally i had concieved of blogdex as a sort of "newswire" for blogger-related content. in the event that one source was finding all of this content, then it could be provided to any number of systems that wanted to use it -- but the network traffic required to find it would be minimal.
in short, i'm working towards being able to provide the content to anyone that wants to work with it (including the wacky brit). as soon as i have some fun playing with different sorts of indexes, i'll begin working on that task. if anyone else is interested in using this content, let me know and i'll tell you when it's available.
as you may have noticed, there has been a recent influx in the number of links to brazilian sites in the past few days. this is due to a surge in the number of brazilian blogs.. due to some strange, unexpected popularity there.
this begs the question: what is the international future of blogdex?
i'm currently thinking about the problem, and the solution will probably involve filtering which is partially automatic and partially human-aided. given groups of blogs in different languages, links could be organized into sets based on their originating blog, or by the designation site. of course, i would keep a pointer to the entire international set.
here is a simple but elegant solution i might incorperate: a python language detector.
i'm sorry to inform that my trusty server, aka the blogdex machine (flux-capacitor.media.mit.edu) was compromised sometime over the past couple of weeks. i'm not sure exactly how it happened, but it needs to be re-installed. the services have been moved over to a new machine which should be up and running by the time you read this.
if you encounter any incosistencies in the site, please let me know. the data is served off of another machine, so none of the content was affected in any way. and now back to your regularly scheduled blogdex.
in the ongoing battle of cameron vs. the blogdex machine, i'm making some headway. here's what's in store for this week:
at current, i'm expanding the interface for browsing an index, where index is an arbitrary ranking of the links. right now i have explicit interfaces through browseTotal.asp, browsePopular.asp, and browseSource.asp. soon i should have something more general so that i can experiment with different kinds of indexes.
two indexes that i've been toying with lately are mass and outward. the latter i mentioned earlier, and the former is aimed at popularity as a function of time. instead of looking at who is in the spotlight in the past 24 hours, i want to look at which links have momentum, namely those that have a majority of their links in the past few days. preliminary results show that this heuristic is better for pulling out memes, whereas the current one is better for locating timely news stories.
some have relayed the concern that the company 'eccentrix.com' has somehow figured out the blogdex algorithm and pushed themselves to the top of the list:
http://blogdex.media.mit.edu/browseSource.asp?url=http://www.eccentrix.com/
which appeared as number 6 on today's list. looking at the sources for this url, it should be noted that all that were crawled yesterday came from angelfire.com. their web servers do not return a standard 404 error page, but instead a regular html page with advertisements, one of which is for eccentrix.com (who i believe is their parent company). anyway to make a long story short, blogdex was not hacked, it just picked up false links from angelfire.com websites that no longer exist. i'll repair the problem sometime over the weekend.
the crawl was ineffective last night, and no new links were posted into the database. the bug has been fixed, and i'm crawling again right now. today's statistics should be up around 5pm EDT.
it would have been fixed earlier, except that i'm on a night schedule, and was fast asleep.
ok, ok, ok. after reading thousands of weblogs over the past couple of weeks, i'm finding it a bit hard to stomach that i myself do not have a weblog of any sort. i also think that using any other method to disseminate information about blogdex would be terribly hypocritical..
so here it is. raw, uncut information about blogdex. i've decided to use greymatter due to the constraints of the system (namely no ftp access, and a win2k machine), and because it's a good piece of software. i don't want anyone to think that this is the "official" weblog system for blogdex.. i'm em..oh nevermind. i don't think that i am political enough for this job. just don't read the decision one way or another.
as for real news, i've been working on fixing bugs for the past few days. every day after crawling, a number of duplicate sites, parsing bugs, and general database garbage needs to be taken care of. the list is getting shorter, so i should be adding a couple of new features shortly.
first on the development cycle is a separation between popular outward and inward links. outward links would be those that point to sites outside the blogger community, with inward being weblogs pointing to other weblogs. these usually take on very different purposes: the former are meant to comment on the rest of the web as a whole (news stories, memes, timely resources, etc.), and the latter are often to designate credit. separating the two will make the individual lists more homogeneous.
finally, for those that are still eager to figure out how many links blogdex has discovered to your site, as many clever people have uncovered, you can hack the following url which is already up:
http://blogdex.media.mit.edu/browseSource.asp?url=
putting the url of your site at the end. i'm only maintaining one version of each site's url, so you may need to try a couple before you find the version that i use. i will put up some kind of search mechanism soon to avoid this nagging hassle.
that's all.


