titles cometh

anyone who has looked at the all-time index today may have noticed that i've added a few titles to the mix. i'm now crawling the lot of titles, but it will take some time before they are up to date.

Posted by cameron on September 28, 2001 at 04:46 PM
indices converge

this is the first time i've noticed a site with intersecting indices. currently the onion currently holds the number 9 spot in both popular and overall indices. go onion.

Posted by cameron on September 27, 2001 at 07:33 PM
daypop top 40

as many may already know, dan chan has started up a service similar to blogdex called the daypop top 40. not surprisingly, the results are pretty similar to those at blogdex.

i have to admit, i knew that someone would come along and reproduce the results that i've had here at blogdex. people have been trolling for links for quite some time (the yale scoop index seems to have been taken down, which was to be the third link), and blogdex wasn't the first weblog indexing tool. it was a simple experiment into social networks which received considerable media attention.

without a doubt, i'm not going to be able to compete with a person or persons who have more time than i do to pursue a similar project, nor do i really like the idea of engaging in competition. this is the reason that i decided to push my advisor to allow blogdex to be open source. i figured that there are enough people in the community that would want to promote a project like this, that it would self-propogate and avoid the typical drive to competition.

i'm not filling out any resignations, as blogdex is my current research project, and will continue to be for quite some time. i have enjoyed the publicity afforded by the project, but moreso i've enjoyed the opportunity to get to know the weblog community, something i didn't expect would happen so quickly. i hope that blogdex will continue to provide some utility in people's lives, and that the more time i invest into it, the more that people enjoy it. that's all.

Posted by cameron on September 26, 2001 at 01:01 AM
media lab open houses

twice a year here at the media lab, we entertain all of our corporate sponsors in a frenzy of talks, workshops and demos (it's so much fun that we have lots of nice affectionate terms for the week). since blogdex is my project, i'll be pushing things into overdrive for a couple of weeks to try and add new features and such. some of the things that i'm going to try and finish are the following:

1. site redesign: since starting the project, i haven't really put much time into the design or general aesthetic. it's starting to bug me, so i'm going to fix it.

2. link activity graphs: as shown in the previous post, i'll be adding a bit of visual stimulus to the process of analyzing a given link. more meme tracking tools.

3. link titles: a big one that i've been embarassed about for a while, but totally necessary. i need to modify the crawler to go out nightly and crawl the linked-to sites in addition to the weblogs.

4. rss syndication: the most requested feature here, i need to figure rss out, and make it work for you all.

there might be some more sneaky surprises, but as for now, that's my laundry list. let me know if anything else strikes you as necessary, and i'll try to do it along the way.

Posted by cameron on September 25, 2001 at 08:36 PM
graphing link activity

i've been working on a simple visualization of link activity, written in php over the past few hours. i really wanted to have some sort of visual reference for the activity of a particular link. here is an example, looking at cnn.com over the past couple of months:

you can see in this image the power of such a simple visualization (er.. graph). as soon as i get php up and running on the blogdex webserver (tomorrow?) i'll add this to the linkinfo page. another interesting graph i stumbled across is for the snopes urban legends database, which peaks on september 13 (instead of the 12th), after rumors had some time to spread:

Posted by cameron on September 24, 2001 at 07:24 PM
blogdex news... syndicated?

i was not familiar with news is free until i noticed them linking to me in my logs. it's a news portal/syndication site that provides access to many traditional and non-traditional news sources. one of them happens to be this weblog in syndicated fashion.

to a certain extent, newsisfree is a realization of much of what i hoped blogdex would become when i first started creating it. it collects, organizes and connects people to the news that they want, including some personal news content. in using it though, i have come to the conclusion that weblogs and traditional news might not mix. it's not that they cover different subject matter, or even that the writing style/quality is divided. i simply realized that i go to weblogs for a different kind of news, one which is more personally meaningful.

weblogs provide a different sort of context than normal news does, linking events and web content to an individual's personal context. traditional media tries to create the most general context, in order to appeal to the widest audience possible. when i read a weblog of someone who is similar to me in some way, i get a piece of my own context in their writing. i think that this is analogous to the warm fuzzy feeling that i get when i read weblogs, and discover that there are thousands of people out there that think just like me.

in the wake of the tragedy of last week, i think i understand better why i felt strangly more attracted to weblog postings than big corporate news. in trying to create the largest possible context, news sources created an information source that didn't apply to anyone. the people of the weblog community wrote just for themselves and friends, and in doing so still appealed to me at least as a human being. big news seemed to miss that point in much of their reporting, which might be because reporters writing for syndicated news know that they're writing for the world.

what's the point of this rant? i'm not really sure. i'm just glad that i get that warm fuzzy feeling when i read weblogs. that's something i seldom get from even my local newspaper.

Posted by cameron on September 21, 2001 at 11:26 AM
network turbulence

grr. it appears that the media lab is encountering some severe network turbulence, both with our dns's and plain communication. the crawler is having some difficulty, so the stats for today are going to be off, regardless.

this is why i usually crawl at night. oh well.

Posted by cameron on September 20, 2001 at 11:39 AM
more about open source

i just realized that sourceforge does not allow you to add yourself to a project. once you're all signed up there, send me an email with your user name, and i'll add you to the project.

Posted by cameron on September 20, 2001 at 11:03 AM
crawler failure

i made a boo boo.

until yesterday, the security on the blogdex database was wide open. not that anyone would actually try and break in and muck around the data, but they could have. well, no more.

i forgot to update the code on the crawler machine, so it was locked out last night. i'll start it up as soon as i get to work, and it should be done by 1ish.

Posted by cameron on September 20, 2001 at 10:05 AM
blogdex open source

as i've told a few people, i'm in the process of making blogdex open source. i've already gotten the go-ahead from my advisor (which is the difficult part, considering all of the intellectual property constraints encountered in a university, and even more so in the media lab with hundreds of corporate sponsors). all of that garbage aside, you can find the open source home of blogdex at sourceforge:

http://sourceforge.net/projects/blogdex/

right now it's a vacant warehouse, not much really to start a community of developers with. but, people are aware of what blogdex is, and what it could be. if you're interested in working on the blogdex project, register at sourceforge if you haven't already and add yourself to the project. this way, we can begin to discuss what the next implementation of blogdex will feature.

right now, my code is in a bit of a sorry state, the effect of working on a project by yourself for a while. i'm going to start uploading it piece by piece to the repository, so people can download it and take a look at it. my goal is to have the entire project online by the end of october, if all goes well. the project code online will be set to interact with a separate blogdex server, other than the one that is always live, so that people can experiment with the data without bringing the entire system to a grinding halt. i'm hoping that the algorithms, techniques, and data used in blogdex will continue to be iterated upon by the community that uses it. then i can use the data from the system to write a kick-ass ph.d. dissertation :)

so sign up and we'll start talking. my sourceforge username is therac25.

Posted by cameron on September 18, 2001 at 11:47 PM
short web outage

the media lab instituted a new firewall for web access, and forgot to add blogdex to the list of allowed webservers. anyone trying to access blogdex from about 6pm last night until this morning at 9pm (including me) was met with a 'server not found' message. things have been resolved with networking, so everything should be back to normal, and the media lab is now harder to hit with silly web server viruses.

on another note, i've been having chronic headaches lately that have been making it hard to work. i had an mri on sunday which thankfully came up negative. now i enter the painstakingly long process of trying to diagnose the actual cause. if i'm slow in responding to feedback, this is probably the reason.

Posted by cameron on September 18, 2001 at 01:28 PM
link to yourself

a simple feature that i added to the system is the ability to link to yourself in any arbitrary index. this way, if you link to yourself, and your place changes over time, then the link will still point to you. the url is a bit complicated:

http://blogdex.media.mit.edu/browseIndex.asp?idx=index_name&url=your_url_here

where index_name is either 'popular' or 'total' and your_url_here is the url for your site in the database. so for instance:

http://blogdex.media.mit.edu/browseIndex.asp?idx=popular&url=http://www.redcross.org/

links to the red cross' website in the popular index and

http://blogdex.media.mit.edu/browseIndex.asp?idx=total&url=http://www.redcross.org/

links to its spot in the all-time links index. because these urls are a bit touchy, i added a spot on the link information page that allows you to quickly link to each of these places in blogdex-space.

Posted by cameron on September 14, 2001 at 07:12 PM
weblog reaction to yesterday's tragic events

i just wanted to comment on the reaction observed today to yesterday's tragic events. i was a bit frustrated that the system updates so slowly at current (once per day) because i would have really liked to see people's opinions and concerns at any given moment. i think that in such a horrific situation, it is comforting to be close to other people experiencing the events, and weblogs definitely provide that closeness.

the response of webloggers to the events yesterday could be described as considerate and charitable, as shown by the links today on blogdex. it makes me feel good to observe this reaction.

to say the least, the events have upset my development cycle, which i'm sure people can understand. at current i'm riveted on the coverage of the fbi search and evacuation of the weston hotel here in boston.

Posted by cameron on September 12, 2001 at 02:15 PM
people get your crawl on

for those of you who have been waiting to be crawled for quite some time, i've decided to expedite the situation. i realized that i just don't have time at the moment to check each and every weblog entered into the database. so, i've decided to just crawl them all. i initially had two worries about automatically adding sites:

1. people would masquerade one site under multiple urls to boost their importance. this has roughly been solved by the checksum system that i have in place. with a few adjustments, i should be able to make it near impossible to get around.

2. sites that aren't weblogs might be added to the system. after thinking about this one for a while, i realized that this isn't really an issue. the weblog quality of a website is determined by how dynamic it is. if someone adds a static site, their statistics will never show up in the system, since only the difference over time is considered. i might try and use some feedback system to help determine when sites that aren't typical weblogs are added (e.g. cnn.com or salon.com).

with these issues semi-resolved, i'm happy to just get everyone crawled and happy. so i did.

Posted by cameron on September 10, 2001 at 08:14 PM
new features! (finally)

i've been working hard over the past week to roll out some new features. for quite some time people have expressed an interest in two tasks:

1. some search facility to discover urls in the database

2. more information about a given url

for number 1, i've implemented a url search page that searches all of the urls contained in blogdex's databases for a string. it uses a simple regular expression to look for urls containing the given string. i might implement something a bit more intricate, and more related to the structure of urls, but for the moment this works well.

i experienced some speed problems from home late last night, but when i returned to work today i was surprised to find that they had gone away. i have ascertained that the actual problem lies in the redirection that i perform when you search, not the search itself. i'm doing the redirection to convert forms into url strings. if you're frustrated with the speed, you can just hack the search url and avoid being redirected.

for number 2, i have a new link information page, which can now be accessed from just about every point in the system. any time a url is displayed, it is now accompanied by an 'info' link which will take you to detailed information about that url. i plan on expanding the list of details as time goes on, and adding graphing tools and other analytical systems. if you have any suggestions for different ways of looking at this information, please let me know.

Posted by cameron on September 06, 2001 at 01:08 PM