RE: Survey of RDF data on the Web from Andreas Eberhart on 2002-08-19 (www-rdf-interest@w3.org from August 2002)

From: Andreas Eberhart <andreas.eberhart@i-u.de>
Date: Mon, 19 Aug 2002 14:24:54 +0200
To: "Danny Ayers" <danny666@virgilio.it>, <www-rdf-interest@w3.org>
Message-ID: <NDBBJEBLMJJIKHDPNFPCIEKHDFAA.andreas.eberhart@i-u.de>

Hi Danny,

> The paper (2.4) states that "RDF subjects, predicates and most objects are
> URLs themselves..."  - errm, not!
> However it's interesting that you did get a good number of links
> using this
> assumption.

oops, you're right. It basically was a (desperate) attempt to find more RDF.
The number of new data obtained by this method actually looks higher than it
really is. For instance many pages found this way have a URL that starts
with http://xmlns.com/wordnet/1.6/. There are not too many distinct hosts.


> I would suspect that there is a large number of HTML pages with linked RDF
> data - in fact it might be worth comparing the difference in the number of
> pages using the different embedding/linking techniques described in Sean
> Palmer's paper (URL anyone?).

That's actually not the case. Almost all links are to http://xmlns.com/,
http://purl.org/, http://www.w3.org/, etc. A lot of the links come from
within that site (e.g. one Wordnet concept having a link to another). RDF
embedded in HTML seems to be the exception.


> I'm a little confused about the crawling strategies (must reread), so
> apologies if you're doing this already, I would imagine that
> filtering sites
> so that outgoing links are only crawled if the current page has
> asssociated
> RDF would significantly reduce the bulk of non-RDF associated
> pages without
> removing too many good links.

That's a good idea. Currently the search depth for the crawling strategy is
only two for each of the 20 starting points, which is too low. I'll
implement your suggestion for the next run and increase the depth.


> It would be nice to see an amalgamated vocabulary containing
> terms and their
> usage frequency (and clashes),
> Another possible avenue of exploration on your data - looking at the ratio
> of hubs (statements referring to a lot of other docs) to authorities
> (docs with a lot of incoming references).

Section 3.4 talks about this. Besides the W3C namespaces, only purl.org and
ns.adobe.com are frequently used. I guess with those namespaces, literals
are typically used. We were hoping to find more references to the Open
Directory for instance (you called those authorities), which would be very
useful to categorize online learning modules for instance. However, this is
very uncommon.

Andreas

Received on Monday, 19 August 2002 08:26:54 UTC