Raw blog data dump

A dataset that might be useful for test/experimental purposes - my
personal blog, from early 2003 to the present:


Rapper reckons 289847 triples. Main vocab is RSS 1.0. It includes
visitor comments (with a proportion of spam), Knobot system-specific
statements, some FOAF, various other bits & pieces.

Although it's valid RDF/XML in other respects it's as rough as can be.
The content is tag soup HTML. Since I started self-hosting the blog
I've switched CMS twice - initially it was Movable Type, then
WordPress, now Knobot. If I remember correctly the MT/WP transition
1970'd a lot of the dates, but the raw data is still in there

One particular challenge re. exposing this via SPARQL or whatever is
that it also contains some email addresses in plain text - these need
be hidden from spammer's harvesters.

License - CC Attribution (i.e. link appreciated if you use the stuff)

There's 2002-2003 blog data at:
via Blogger - but I've yet to get a dump. (Too busy blogging ;-)




Received on Friday, 20 April 2007 09:50:17 UTC