Re: .htaccess a major bottleneck to Semantic Web adoption / Was: Re: RDFa vs RDF/XML and content negotiation from Hugh Glaser on 2009-07-09 (public-lod@w3.org from July 2009)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Thu, 9 Jul 2009 11:08:00 +0100
To: Peter Ansell <ansell.peter@gmail.com>, Juan Sequeda <juanfederico@gmail.com>
CC: Linked Data community <public-lod@w3.org>
Message-ID: <EMEW3|6f17285e83ba39dd124ccfbcb8f2e395l68B8C02hg|ecs.soton.ac.uk|C23E%hg@ecs.so>

On 09/07/2009 07:56, "Peter Ansell" <ansell.peter@gmail.com> wrote:

> 2009/7/9 Juan Sequeda <juanfederico@gmail.com>:
>> On Jul 9, 2009, at 2:25 AM, Hugh Glaser <hg@ecs.soton.ac.uk> wrote:
<snip hash URI comments>
>>> Mind you, it does mean that you should make sure that you don't put too
>>> many
>>> LD URIs in one document.
>>> If dbpedia decided to represent all the RDF in one document, and then use
>>> hash URIs, it would be somewhat problematic.
>> 
>> Could you explain why???
> 
> Does it seem reasonable to have to trawl through millions (or
> billions) of RDF triples resolved from a large database that only used
> one base URI with fragment identifiers for everything else if you
> don't need to considering that 100 specific RDF triples in a compact
> document might have been all you needed to see?
> 
> Peter
> 
> 
As a concrete example:
For dblp we split the data into year models, before asserting into the
triplestore, so we can serve RDF for each URI, by sort of DESCRIBing.
Paper:
http://dblp.rkbexplorer.com/id/journals/expert/ShadboltGGHS04
comes from a model file:
http://dblp.rkbexplorer.com/models/dblp-publications-2004.rdf
which is 155MB.
using hash URIs would require files of that size to be served for every
access, although if we were actually doing it that way we would of course
change our model file granularity size to avoid it.
So there is both possible network and processing overhead, which can be got
wrong.
In fact large foaf files give you quite a lot of extra stuff, if all you
wanted was some personal details.
When you want to know about timbl, if you only wanted his blog address you
don't necessarily want to download and process 30-odd KB of RDF, much of it
details of the people he knows (such as Tom Ilube's URI).
Just something to be aware of when serving linked data as hash.

And to add something else to the mix.
This is another reason semantic sitemaps are so important for search engines
like Sindice.
Sindice can index our model file, but on receiving a request for a URI in
it, without the sitemap, all it could easily be able to do would be point
the requester at the 155MB model file. Because of the sitemap, it can much
more easily work out for itself what it needs to know about the URI to point
the user at the linked data URI - all without spidering our whole
triplestore, which would be unacceptable.

Ah, the rich tapestry of life that is linked data!

Best
Hugh

Received on Thursday, 9 July 2009 10:09:00 UTC