Re: Think before you write Semantic Web crawlers from Henry Story on 2011-06-21 (public-lod@w3.org from June 2011)

From: Henry Story <henry.story@bblfish.net>
Date: Tue, 21 Jun 2011 13:06:48 +0200
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: public-lod@w3.org
Message-Id: <D8F7DEA5-2C14-45D5-AADF-FE7BE2AA2157@bblfish.net>

On 21 Jun 2011, at 12:23, Kingsley Idehen wrote:

> On 6/21/11 10:54 AM, Henry Story wrote:
>> 
>> Then you could just redirect him straight to the n3 dump of graphs of your site (I say graphs because your site not necessarily being consistent, the crawler may be interested in keeping information about which pages said what)
>> Redirect may be a bit harsh. So you could at first link him to the dump
> 
> Only trouble with the above, is that many don't produce graph dumps anymore, they just have SPARQL endpoints, then you pound the endpoints and hit timeouts etc..

I would say it is even more important to place SPARQL endpoints behind WebID authentication. If you don't do that you 
are open to horrendous queries being asked that would be better solved by downloading the dump. Also there is no way of distinguishing good customers from bad ones, and so you end up serving everyone badly. 

The closest similar thing to  SPARQL endpoints on the web are search engines query interfaces. But they purposefully limited the queries they had to answer to simple + - logic.  And for engines like AltaVista which was owned by Digital Equipment Corporation (DEC) a hardware manufacturer, the point was to show off the power of their 64 bit chips and hardware in 1995. The more load those servers could take the stronger their marketing for their hardware could then be.
So their business model was to sell hardware. Unless you want everyone to deploy huge numbers of machines for every sparql endpoint - and support the construction of a large number of nuclear power stations to feed that need - you need to control access more carefully at the source. The best policy is allow all access, but keep an eye open for abuse. Here also a WebID pointing to an e-mail address or pingback endpoint could be very useful. 

:spider a web:Crawler;
   foaf:mbox <mailto:cralwer@open-uni.edu>;
   doap:project <http://gitub.org/rdf-crawler/> .

Information like that could be very useful of course.
> 
> A looong time ago, very early LOD days, we (LOD community) talked about the importance of dumps with the heuristic you describe in mind (no WebID then, but it was clear something would emerge). Unfortunately, SPARQL endpoints have become the first point of call re. Linked Data even though SPARQL endpoint only == asking for trouble if you can self protect the endpoint and re-route agents to dumps.

yes,  a sparql in an unwise hand can lead to serious explosions.

> Maybe we can use WebID and recent troubles as basis for reestablishing this most vital of best practices re. Linked Data publication. Of course, this is also awesome dog-fooding too!

The WebID community (nee foaf+ssl) is really keen to help I am sure. We have libs in all languages ready to go. WebID is especially easy to implement for server to server communication btw.

Henry

> 
> -- 
> 
> Regards,
> 
> Kingsley Idehen	
> President&  CEO
> OpenLink Software
> Web: http://www.openlinksw.com
> Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca: kidehen
> 
> 
> 
> 
> 
> 

Social Web Architect
http://bblfish.net/

Received on Tuesday, 21 June 2011 11:07:30 UTC