W3C home > Mailing lists > Public > www-archive@w3.org > April 2007

Raw blog data dump

From: Danny Ayers <danny.ayers@gmail.com>
Date: Fri, 20 Apr 2007 11:50:07 +0200
Message-ID: <1f2ed5cd0704200250q589f111ckc027557fb71aab98@mail.gmail.com>
To: linking-open-data@simile.mit.edu
Cc: www-archive@w3.org

A dataset that might be useful for test/experimental purposes - my
personal blog, from early 2003 to the present:

http://dannyayers.com:88/data/raw_2007-04-20.rdf.gz

Rapper reckons 289847 triples. Main vocab is RSS 1.0. It includes
visitor comments (with a proportion of spam), Knobot system-specific
statements, some FOAF, various other bits & pieces.

Although it's valid RDF/XML in other respects it's as rough as can be.
The content is tag soup HTML. Since I started self-hosting the blog
I've switched CMS twice - initially it was Movable Type, then
WordPress, now Knobot. If I remember correctly the MT/WP transition
1970'd a lot of the dates, but the raw data is still in there
somewhere.

One particular challenge re. exposing this via SPARQL or whatever is
that it also contains some email addresses in plain text - these need
be hidden from spammer's harvesters.

License - CC Attribution (i.e. link appreciated if you use the stuff)
http://creativecommons.org/licenses/by/2.5/

There's 2002-2003 blog data at:
http://semtext.org/semblog
via Blogger - but I've yet to get a dump. (Too busy blogging ;-)

Cheers,
Danny.

-- 

http://dannyayers.com
Received on Friday, 20 April 2007 09:50:17 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:33:06 UTC