Re: The Power of Virtuoso Sponger Technology from Hugh Glaser on 2009-10-18 (public-lod@w3.org from October 2009)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Sun, 18 Oct 2009 15:33:39 +0100
To: "giovanni.tummarello@deri.org" <giovanni.tummarello@deri.org>
CC: "public-lod@w3.org" <public-lod@w3.org>, Sindice general discussions list <sindice-general@lists.deri.org>
Message-ID: <EMEW3|b422b1038d57b21ea5b18364abb72f87l9HFXp02hg|ecs.soton.ac.uk|128%hg@ecs.sot>
Hi.

On 18/10/2009 14:56, "Giovanni Tummarello" <g.tummarello@gmail.com> wrote:

> Hi Hugh, thanks for your contribution
> 
> 
> .. it turns out this discussion is in fact very very important and
Agreed.
> such feedback is indeed very useful
> 
> if i just get a sitemap from sponger (which is wrapping a sitemap from
> another site)
> then all i can do is really just crawling that sitemap which would
> call the sponger to be banned from the remote site which is beiign
> wrapped.
> 
> a way to avoid that is to implement a mechanism by which the sitemap
> tells me what it is doing "hey i wrap amazon, so i can be involked
> with anmes of people and tell you books they might havewritten or
> names of book and give you prieces or something else" and then sindice
> could use that on the fly when a request comes.
> 
> ... this makes sense.. but we're back to semantic web services are we not? :-)
> 
> i mean to be able to express the above sentence to the point where
> sindice or who else knows when to invoke that wrapper we'd have to
> come up with such a complicated description language that it would
> simply.. never be adopted. (see the SWS lession)
My suggestion was that the wrapper should also wrap the Sitemap, as well as
the data itself.

I see a couple of ways of doing this, but there are probably more:

A) The wrapper's Semantic Sitemap points you at the original Sitemap, and
says how it is doing the wrapping. And because you know how the wrapper is
behaving, you can process the standard Sitemap to get the information you
want about what the wrapping site provides.
Actually, the "slicing" in the current spec is something similar to this -
my Linked Data site is a wrapper around my SPARQL endpoint, and I provide a
description of this along with dumps of the contents of the RDF store.

B) Another way is for the wrapper to actually process the Sitemap and data
dumps to produce a Semantic Sitemap and RDF dumps. Really wrapping the whole
site, not just the data. This would require no extra facilities at the
Sindice end.

Neither of these require you to do any crawling of the data.
I guess you might be saying that these specs would be too complicated, but
it must be worth a shot? We have a lot to gain.

Cheers
Hugh
> 
> Search engines, n the other hand, are indeed allowed to crawl Amazon
> and other site. and do my own sponging, why not, google does it.
> 
> so i guess it could boil down to
> 
> a) got nmative RDFa ? ok we crawl you .. but we cant be that updated
> b) got native RDFa and understand that it is a value for you for
> engines to be very updated? then provide a dump. But we shoul dnot
> forget what the actual web is doing either :-) e.g. so probably we
> better start implementing assap this
> http://www.readwriteweb.com/archives/real-time_web_protocol_pubsubhubbub_expla
> ined.php
>  Pubsubhubhub thingie
> c) got nothing? well we might crawl you normally with some spongers on
> top? (but this is still something that puzzles me. One thing in
> sindice is that everything that's in there hsa been explicitly stated
> by data producers,.. if i started to do this i'd be losing the ability
> to say this. Again i am not sure this really matters but you might
> lose the ability to claim fair use by saying the entire system is
> automated (AFAIU the main defense google has for collecting and show
> (e.g. in the previus) all the material is that they.. collect all
> automatically no human intervenction)
> 
> 
> cheers
> 
> Giovanni
> 
> 
> 
> On Sun, Oct 18, 2009 at 7:57 AM, Hugh Glaser <hg@ecs.soton.ac.uk> wrote:
>> Hi Guys,
>> I am puzzled by the whole discussion, so will try to summarise to find out
>> if I have some misunderstanding.
>> 
>> It really is "just" about finding where the URIs are, and search engines are
>> the game in town. We need to make it really easy for people to find the
>> Linked Data URIs they need. Wrappers make things a bit harder.
>> 
>> Juan asked if "Sindice crawled the whole regular web and checked the
>> Spongers for each URL (sic!)".
>> I read this as: "Can I use Sindice to find Linked Data URIs provided by the
>> Spongers?" Or to put it yet another way, "Does Sindice index the part of the
>> Semantic Web provided by the Spongers?"
>> 
>> One way to do this would be to do what Juan suggests - model what the
>> Spongers are doing, and then infer what the Linked Data URIs would be, based
>> on the URLs of the underlying web pages, having crawled them.
>> 
>> But there seems to me a much simpler and more principled way - the Sponger
>> should do it.
>> Spongers should provide Semantic Sitemaps (and of course voiD descriptions),
>> so that Sindice can index (not *crawl*, which I think has lead to some of
>> the confusion) the sites.
>> 
>> How might this be done?
>> Well, certainly where the Sponger is connected to a particular site which
>> has an ordinary Sitemap, it could/should process it as part of the
>> connection with a site, and then re-publish the Semantic Sitemap. For sites
>> that don't have Sitemaps, it may/will be somewhat harder.
>> I may be misunderstanding Spongers as well, but it all seems pretty clean
>> and straightforward to me.
>> 
>> Great stuff, of course.
>> 
>> Best
>> Hugh
>> 
>>>> 
>>>> On Sat, Oct 17, 2009 at 3:32 PM, Juan Sequeda <juanfederico@gmail.com>
>>>> wrote:
>>>> 
>>>>> But Sindice could at least crawl Amazon.
>>>>> It would be great to use sig.ma to create a "meshup" with the amazon data.
>>>>> 
>>>>> 
>>>>> Juan Sequeda, Ph.D Student
>>>>> Dept. of Computer Sciences
>>>>> The University of Texas at Austin
>>>>> www.juansequeda.com
>>>>> www.semanticwebaustin.org
>>>>> 
>>>>> 
>>>>> On Sat, Oct 17, 2009 at 9:28 AM, Martin Hepp (UniBW)
>>>>> <hepp@ebusiness-unibw.org> wrote:
>>>>> 
>>>>>> I don't think so, because this would require that Sindice crawled the
>>>>>> whole regular web and checked the Spongers for each URL (sic!).
>>>>>> 
>>>>>> Juan Sequeda wrote:
>>>>>> 
>>>>>> Does Sindice crawl this (or any other semantic web search engines)?
>>>>>> Juan Sequeda, Ph.D Student
>>>>>> Dept. of Computer Sciences
>>>>>> The University of Texas at Austin
>>>>>> www.juansequeda.com
>>>>>> www.semanticwebaustin.org
>>>>>> 
>> 
>> 
>>
Received on Sunday, 18 October 2009 14:34:23 UTC