Re: Billion Triples Challenge Crawl 2014

Tim Berners-Lee wrote:

> Agree with High. I would encourage you unzip the data 
> files on your own servers so the URIs will work and your 
> data is really Linked Data.

I think it's possible to make the URIs link properly without 
uncompressing the data on the server.

Suppose the data lives in /download.rdf.gz.  You can make 
the webserver 303 other URIs to /download.rdf, and have 
/download either send download.rdf.gz iff the request said 
Accept-Encoding: gzip, or to 406 otherwise.

It's quite feasible to orchestrate this in Apache.  I've 
just done it as follows, but I'm sure there are more elegant 
ways:

   AddType application/rdf+xml .rdf

   Options -MultiViews

   RewriteEngine on
   RewriteBase /~richard/foaf

   RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
   RewriteCond %{REQUEST_FILENAME} !__406
   RewriteCond %{REQUEST_URI} !download.rdf
   RewriteRule (.*) download.rdf [L,R=303]

   RewriteCond %{HTTP:Accept-Encoding} gzip
   RewriteRule download.rdf download.rdf.gz [L,PT]

   RewriteCond %{REQUEST_FILENAME} !__406
   RewriteRule (.*) __406 [L,PT]
   RedirectMatch 406 /__406

Then all you need is for the client to support gzip content 
encoding, and many of the common HTTP client libraries do. 
For example, if I run:

   curl -s --compressed -L -H 'Accept: application/rdf+xml' \
     http://localhost/~richard/foaf/alice | rapper -q \
     -o turtle - http://localhost/~richard/foaf/alice

I get:

   @base <http://localhost/~richard/foaf/alice> .
   @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
   @prefix foaf: <http://xmlns.com/foaf/0.1/> .

   <>
     a foaf:Person ;
     foaf:knows <bob> ;
     foaf:name "Alice" .

   <bob>
     a foaf:Person ;
     foaf:knows <> ;
     foaf:name "Bob" .

And if you repeat the process fetching <bob>, you'll end up 
with precisely the same triples.

The remaining question is whether it's reasonable to expect 
clients to support gzip content-encoding.  It doesn't seem 
unreasonable to me.

Richard

Received on Sunday, 16 February 2014 20:01:58 UTC