- From: Andreas Eberhart <andreas.eberhart@i-u.de>
- Date: Mon, 19 Aug 2002 14:24:54 +0200
- To: "Danny Ayers" <danny666@virgilio.it>, <www-rdf-interest@w3.org>
Hi Danny, > The paper (2.4) states that "RDF subjects, predicates and most objects are > URLs themselves..." - errm, not! > However it's interesting that you did get a good number of links > using this > assumption. oops, you're right. It basically was a (desperate) attempt to find more RDF. The number of new data obtained by this method actually looks higher than it really is. For instance many pages found this way have a URL that starts with http://xmlns.com/wordnet/1.6/. There are not too many distinct hosts. > I would suspect that there is a large number of HTML pages with linked RDF > data - in fact it might be worth comparing the difference in the number of > pages using the different embedding/linking techniques described in Sean > Palmer's paper (URL anyone?). That's actually not the case. Almost all links are to http://xmlns.com/, http://purl.org/, http://www.w3.org/, etc. A lot of the links come from within that site (e.g. one Wordnet concept having a link to another). RDF embedded in HTML seems to be the exception. > I'm a little confused about the crawling strategies (must reread), so > apologies if you're doing this already, I would imagine that > filtering sites > so that outgoing links are only crawled if the current page has > asssociated > RDF would significantly reduce the bulk of non-RDF associated > pages without > removing too many good links. That's a good idea. Currently the search depth for the crawling strategy is only two for each of the 20 starting points, which is too low. I'll implement your suggestion for the next run and increase the depth. > It would be nice to see an amalgamated vocabulary containing > terms and their > usage frequency (and clashes), > Another possible avenue of exploration on your data - looking at the ratio > of hubs (statements referring to a lot of other docs) to authorities > (docs with a lot of incoming references). Section 3.4 talks about this. Besides the W3C namespaces, only purl.org and ns.adobe.com are frequently used. I guess with those namespaces, literals are typically used. We were hoping to find more references to the Open Directory for instance (you called those authorities), which would be very useful to categorize online learning modules for instance. However, this is very uncommon. Andreas
Received on Monday, 19 August 2002 08:26:54 UTC