email verification added
update .. update
english loses dominance
too much cohesion!
blogdex-related sites
aaaaaah! stop!
blogdex in playboy
adding weblogs
July 2003
June 2003
April 2003
March 2003
January 2003
December 2002
November 2002
October 2002
August 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
for those of you who viewed blogdex this morning, and saw lots of corporate material, i'd like to reiterate that this is a difficult problem to diagnose.
when a set of weblogs using some hosting service (who need not have urls that relate to that service) go down for some reason, they are often replaced by a default webpage from the host. in the case of last night, a small subset of the tripod weblogs were taken down, and replaced with a tripod 404 page, which contains links to lots of other lycos sites. this is what caused these sites to appear. it is an easy problem to fix, but difficult to detect.
i apologize for any confusion this may have created.
in an attempt to subvert those pesky spammers, i've added an email verification loop into the weblog addition system. for all new weblogs, an email is required to validate the site, which results in a message like this being sent:
to complete your registration, please follow this link:
http://blogdex.media.mit.edu/validate.asp?url=http://somesite.com/&val=XXXXXX
if this link does not work, go to http://blogdex.media.mit.edu/validate.asp, using the following information:
url: http://somesite.com/
validation code: XXXXXX
thank you, and happy blogging!
+the blogdex team+
i think that this is enough stress to stop anyone from maliciously adding non-blogs. if they persist, at least now i have an email address to reply to. i guess even now i could do as michael fraase suggests, and give them a run for their money.
of course, with 5000 dubious websites, i might again have to resort to draconian purges, including probably everything with ".de," "sex," "casino," etc., contained in the url. this should at least get things down to a more managable set. if you're not sure whether or not your weblogs is in the index, please check by searching for the url.
hopefully this will be the last post about security.
we've been working hard on development for the past few weeks, which has made me lax in my posting. here's a taste of what's coming:
crawling upgrade: the crawler is currently being optimized, so that within a few weeks we hope to have up to the hour statistics on weblogs. we've realized that daily is not good enough, and that predicition will have to be a much bigger part of our crawl scheduling.
social network: for the past month we've been developing and testing a new sort of view of the weblog community based on the social network. i can't say exactly what kind of interface we plan to put on it (since we haven't thought of it yet), but i can say that it's very interesting data. something that has come up often: the weblog community has 5.818 degrees of separation, based on current information. more to come.. soon..
opensource: we're still working on moving blogdex to an opensource environment. we're hoping to have a mirror up for development sometime early in january.
of the 500 million web users, only 45% are native speakers of english, says a report by the ITTA (U.S. Internet Council and International Technology & Trade Associates Inc).
internationalization is probably the biggest leap to overcome for blogdex and other aggregators. and i'm not convinced that just separating people into their respective languages is the best option.
thanks to tools like blogdex and daypop, i think that there is now LOTS of cohesion in the weblog community. for instance, when i first started crawling, the greatest number of an individual link i would find would be on the order of 5 per day. now it is not uncommon to find 30 or 50. this means that the blogdex index now is more like a billboard chart, posting popular links for the longevity of their tenure at the top.
this is fine for people who don't check but once a week, but what about those of us who are constantly looking for the fresh and new? i'm thinking about devising a new metric that looks in a much shorter time frame and stresses timeliness. i'll play around with a couple of algorithms, and possibly throw one together this weekend. i miss the old days of blogdex, when every day, 90% of the links were something i had never seen.
i've noticed these in my referrers over the past couple of days:
http://blogdex.tripod.co.jp/ appears to be a japanese version of blogdex, none of which i comprehend (if you do understand japanese, this is their faq).
http://blogdex.pitas.com/ is an atomz search of blogdex. not too sure what it is useful for, but it works.
i really wish i knew who is spamming our site additions. i don't think that they realize that IT DOESN'T WORK. i'd rather not ever add anyone again than make the mistake of adding one of their damn sites.
the only solution i can resort to from here is requiring some sort of email feedback to have your weblog added, which isn't exactly guaranteed to stop them, just slow them down.
any other suggestions? right now i'm only allowing one weblog submission per ip address per day.
in what might go down as the crowning achievement in my graduate career, blogdex was mentioned in the december issue of playboy. of course when i bought the issue, i told the sales clerk at the 7-11 that "hey, i'm in it," to which he nodded with a sort of confused look. the author writes:
"I look at Blogdex every day, and it always pays off. When I last checked, Blogdex reported the most popular links were for a CNN story about McDonalds contest scandal, a bizarre Shockwave cartoon featuring man-headed robots and marauding elephants, and a Yahoo article about a protester who chopped off a testicle on the steps of PEru's parliament building (Fortunately, his doctor said he'd be able to enjoy a 'normal sex life')."
sounds like a pretty accurate representation of the sites one might find here.
i've essentially stopped the massive flood of weblog addition spam. in order to prune the list down to a reasonable size, i had to delete the following:
1. any sites added by an ip address that added more than 2 weblogs over time
2. all sites added by .de, .ch, and .at domains
3. some of those with suspicious names
i apologize for the necessity to do this, but it should be good from here. if you suspect that your weblog has been deleted, please add it again and i will validate you asap.


