Re: Think before you write Semantic Web crawlers from Kingsley Idehen on 2011-06-21 (public-lod@w3.org from June 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 21 Jun 2011 12:38:36 +0100
To: public-lod@w3.org
Message-ID: <4E0082BC.3040303@openlinksw.com>
On 6/21/11 12:06 PM, Henry Story wrote:
> On 21 Jun 2011, at 12:23, Kingsley Idehen wrote:
>
>> On 6/21/11 10:54 AM, Henry Story wrote:
>>> Then you could just redirect him straight to the n3 dump of graphs of your site (I say graphs because your site not necessarily being consistent, the crawler may be interested in keeping information about which pages said what)
>>> Redirect may be a bit harsh. So you could at first link him to the dump
>> Only trouble with the above, is that many don't produce graph dumps anymore, they just have SPARQL endpoints, then you pound the endpoints and hit timeouts etc..
> I would say it is even more important to place SPARQL endpoints behind WebID authentication. If you don't do that you
> are open to horrendous queries being asked that would be better solved by downloading the dump.

Yes, but even WebID alone doesn't protect against inadvertent DOS. This 
is why the SPARQL engine needs to have server side capabilities that 
control:

1. Result Set Size
2. Query fulfilment timeouts
3. Granulary Query Cost Optimizer.


>   Also there is no way of distinguishing good customers from bad ones, and so you end up serving everyone badly.

Solved when you can make fine grained QoS based on the features above. 
This is really the start point for serious SPARQL endpoints. It's been 
in Virtuoso forever, otherwise DBpedia wouldn't have been possible. 
Ditto LOD cloud cache, ditto Sindice's endpoint, and lots of other heavy 
duty endpoints .

> The closest similar thing to  SPARQL endpoints on the web are search engines query interfaces. But they purposefully limited the queries they had to answer to simple + - logic.  And for engines like AltaVista which was owned by Digital Equipment Corporation (DEC) a hardware manufacturer, the point was to show off the power of their 64 bit chips and hardware in 1995. The more load those servers could take the stronger their marketing for their hardware could then be.
> So their business model was to sell hardware. Unless you want everyone to deploy huge numbers of machines for every sparql endpoint - and support the construction of a large number of nuclear power stations to feed that need - you need to control access more carefully at the source. The best policy is allow all access, but keep an eye open for abuse. Here also a WebID pointing to an e-mail address or pingback endpoint could be very useful.
>
> :spider a web:Crawler;
>     foaf:mbox<mailto:cralwer@open-uni.edu>;
>     doap:project<http://gitub.org/rdf-crawler/>  .
>
> Information like that could be very useful of course.

It can be more granular than that.

Example rules:

1. Henry (verified via  WebID carried by his HTTP User Agent) can 
execute queries with higher fulfillment costs than "Joe SemWeb Project 
Researcher" (who doesn't have a WebID)
2. Queries from a given domain, for a User Agent with a WebID can 
execute queries that require N milliseconds for fulfillment plan 
construction
3. Ditto but for actual time
4. Ditto but for partial results
5. etc...


>> A looong time ago, very early LOD days, we (LOD community) talked about the importance of dumps with the heuristic you describe in mind (no WebID then, but it was clear something would emerge). Unfortunately, SPARQL endpoints have become the first point of call re. Linked Data even though SPARQL endpoint only == asking for trouble if you can self protect the endpoint and re-route agents to dumps.
> yes,  a sparql in an unwise hand can lead to serious explosions.

Yes, and in the InterWeb jungle you have to assume everyone is unwise :-)
>> Maybe we can use WebID and recent troubles as basis for reestablishing this most vital of best practices re. Linked Data publication. Of course, this is also awesome dog-fooding too!
> The WebID community (nee foaf+ssl) is really keen to help I am sure. We have libs in all languages ready to go. WebID is especially easy to implement for server to server communication btw.
>

Yep!

Links:

1. 
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtAuthPolicyFOAFSSL 
-- Virtuoso and WebID protection of SPARQL endpoints
2. 
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtOAuthSPARQL 
-- Virtuoso and OAuth based protection of SPARQL endpoints


Kingsley
> Henry
>
>> -- 
>>
>> Regards,
>>
>> Kingsley Idehen	
>> President&   CEO
>> OpenLink Software
>> Web: http://www.openlinksw.com
>> Weblog: http://www.openlinksw.com/blog/~kidehen
>> Twitter/Identi.ca: kidehen
>>
>>
>>
>>
>>
>>
> Social Web Architect
> http://bblfish.net/
>
>
>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Tuesday, 21 June 2011 11:39:13 UTC