Re: Crawlers need content negotiation, not! was: Re: URL +1, LSID -1 from Eric Jain on 2007-07-16 (public-semweb-lifesci@w3.org from July 2007)

From: Eric Jain <Eric.Jain@isb-sib.ch>
Date: Mon, 16 Jul 2007 11:16:57 +0200
To: Alan Ruttenberg <alanruttenberg@gmail.com>
CC: wangxiao@musc.edu, Michel_Dumontier <Michel_Dumontier@carleton.ca>, public-semweb-lifesci <public-semweb-lifesci@w3.org>, Mark Wilkinson <markw@illuminae.com>, Benjamin Good <goodb@interchange.ubc.ca>, Natalia Villanueva Rosales <naty.vr@gmail.com>
Message-ID: <469B3789.7070801@isb-sib.ch>

Alan Ruttenberg wrote:
> Except this isn't an issue. A link in the html suffices to let them know 
> where the RDF is, and the extra retrieval isn't going to kill them. 

There are something like 30M RDF documents on http://beta.uniprot.org/ 
alone. If for each document you have to retrieve and parse a web page 
first, that more than doubles the number of requests (and data volume)!

> There are plenty of alternatives for optimization (google's site map 
> file comes to mind, or the LINK: http header) that are not prone to 
> unnecessarily introducing avoidable ambiguity on the semantic web.

The people working on http://www.sindice.com/ have proposed a site map 
extension for optimizing crawling, see http://purl.uniprot.org/sitemap.xml.

The Link header sounds like a good idea (never heard of that before), but 
at the moment it seems simpler for someone who wants to get only RDF 
documents to set an Accept header. This will also ensure that you are not 
redirected (and waste a request) for a resource that doesn't even have RDF.

Received on Monday, 16 July 2007 09:17:21 UTC