Re: Think before you write Semantic Web crawlers from Sebastian Schaffert on 2011-06-22 (public-lod@w3.org from June 2011)

From: Sebastian Schaffert <sebastian.schaffert@salzburgresearch.at>
Date: Wed, 22 Jun 2011 22:33:36 +0200
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: public-lod <public-lod@w3.org>
Message-Id: <A56C5BA2-5356-45EB-8997-6B034476CA36@salzburgresearch.at>
Martin,

I followed the thread a bit, and I have just a small and maybe naive question: what use is a Linked Data Web that does not even scale to the access of crawlers? And how to we expect agents to use Linked Data if we cannot provide technology that scales?

Your complaint sounds to me a bit like "help, too many clients access my data". I think worse things could happen. What we need to do is improve our technology, and not whining about people trying to use our data. Even though it is not good if people stop *providing* Linked Data, it is also not good if people stop *using* Linked Data. And I find your approach of stopping to send pings counter-productive.

My 2 cents to the discussion ... :-)

Greetings,

Sebastian


Am 22.06.2011 um 20:57 schrieb Martin Hepp:

> Jiri:
> The crawlers causing problems were run by Universities, mostly in the context of ISWC submissions. No need to cast any doubt on that.
> 
> All:
> As a consequence to those events, I will not publish sitemaps etc. of future GoodRelations datasets on these lists, but just inform non-toy consumers.
> If you consider yourself a non-toy consumer of e-commerce data, please send me an e-mail, and we will add you to out ping chain.
> 
> We will also stop sending pings to PTSW, Watson, Swoogle, et al., because they will just expose sites adopting GoodRelations and related technology to academic crawling.
> 
> In the meantime, I recommend the LOD bubble diagram sources for self-referential research.
> 
> Best
> M. Hepp
> 
> 
> 
> On Jun 22, 2011, at 4:03 PM, Jiří Procházka wrote:
> 
>> I understand that, but I doubt your conclusion, that those crawlers are
>> targeting semantic web, since like you said they don't even properly
>> identify themselves and as far as I know, on Universities also regular
>> web search and crawling is researched. Maybe lot of them are targeting
>> semantic web, but we should look at all measures to conserve bandwidth,
>> from avoiding regular web crawler interest, aiding infrastructure like
>> Ping the Semantic Web to optimizing delivery and even distribution of
>> the data among resouces.
>> 
>> Best,
>> Jiri
>> 
>> On 06/22/2011 03:21 PM, Martin Hepp wrote:
>>> Thanks, Jiri, but the load comes from academic crawler prototypes firing from broad University infrastructures.
>>> Best
>>> Martin
>>> 
>>> 
>>> On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:
>>> 
>>>> I wonder, are ways to link RDF data so that convential crawlers do not
>>>> crawl it, but only the semantic web aware ones do?
>>>> I am not sure how the current practice of linking by link tag in the
>>>> html headers could cause this, but it may be case that those heavy loads
>>>> come from a crawlers having nothing to do with semantic web...
>>>> Maybe we should start linking to our rdf/xml, turtle, ntriples files and
>>>> publishing sitemap info in RDFa...
>>>> 
>>>> Best,
>>>> Jiri
>>>> 
>>>> On 06/22/2011 09:00 AM, Steve Harris wrote:
>>>>> While I don't agree with Andreas exactly that it's the site owners fault, this is something that publishers of non-semantic data have to deal with.
>>>>> 
>>>>> If you publish a large collection of interlinked data which looks interesting to conventional crawlers and is expensive to generate, conventional web crawlers will be all over it. The main difference is that a greater percentage of those are written properly, to follow robots.txt and the guidelines about hit frequency (maximum 1 request per second per domain, no parallel crawling).
>>>>> 
>>>>> Has someone published similar guidelines for semantic web crawlers?
>>>>> 
>>>>> The ones that don't behave themselves get banned, either in robots.txt, or explicitly by the server. 
>>>>> 
>>>>> - Steve
>>>>> 
>>>>> On 2011-06-22, at 06:07, Martin Hepp wrote:
>>>>> 
>>>>>> Hi Daniel,
>>>>>> Thanks for the link! I will relay this to relevant site-owners.
>>>>>> 
>>>>>> However, I still challenge Andreas' statement that the site-owners are to blame for publishing large amounts of data on small servers.
>>>>>> 
>>>>>> One can publish 10,000 PDF documents on a tiny server without being hit by DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>>>>>> 
>>>>>> But for sure, it is necessary to advise all publishers of large RDF datasets to protect themselves against hungry crawlers and actual DoS attacks.
>>>>>> 
>>>>>> Imagine if a large site was brought down by a botnet that is exploiting Semantic Sitemap information for DoS attacks, focussing on the large dump files. 
>>>>>> This could end LOD experiments for that site.
>>>>>> 
>>>>>> 
>>>>>> Best
>>>>>> 
>>>>>> Martin
>>>>>> 
>>>>>> 
>>>>>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hi Martin,
>>>>>>> 
>>>>>>> Have you tried to put a Squid [1]  as reverse proxy in front of your servers and use delay pools [2] to catch hungry crawlers?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Daniel
>>>>>>> 
>>>>>>> [1] http://www.squid-cache.org/
>>>>>>> [2] http://wiki.squid-cache.org/Features/DelayPools
>>>>>>> 
>>>>>>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>>>>>>> 
>>>>>>>> Hi all:
>>>>>>>> 
>>>>>>>> For the third time in a few weeks, we had massive complaints from site-owners that Semantic Web crawlers from Universities visited their sites in a way close to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a parallelized approach.
>>>>>>>> 
>>>>>>>> It's clear that a single, stupidly written crawler script, run from a powerful University network, can quickly create terrible traffic load. 
>>>>>>>> 
>>>>>>>> Many of the scripts we saw
>>>>>>>> 
>>>>>>>> - ignored robots.txt,
>>>>>>>> - ignored clear crawling speed limitations in robots.txt,
>>>>>>>> - did not identify themselves properly in the HTTP request header or lacked contact information therein, 
>>>>>>>> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.
>>>>>>>> 
>>>>>>>> This irresponsible behavior can be the final reason for site-owners to say farewell to academic/W3C-sponsored semantic technology.
>>>>>>>> 
>>>>>>>> So please, please - advise all of your colleagues and students to NOT write simple crawler scripts for the billion triples challenge or whatsoever without familiarizing themselves with the state of the art in "friendly crawling".
>>>>>>>> 
>>>>>>>> Best wishes
>>>>>>>> 
>>>>>>>> Martin Hepp
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 

Sebastian
-- 
| Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg
Received on Wednesday, 22 June 2011 20:34:15 UTC