W3C home > Mailing lists > Public > public-lod@w3.org > October 2009

Re: The Power of Virtuoso Sponger Technology

From: Giovanni Tummarello <g.tummarello@gmail.com>
Date: Sun, 18 Oct 2009 14:56:17 +0100
Message-ID: <210271540910180656v76a0ce06uac3689863afd763@mail.gmail.com>
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: "public-lod@w3.org" <public-lod@w3.org>, Sindice general discussions list <sindice-general@lists.deri.org>
Hi Hugh, thanks for your contribution


.. it turns out this discussion is in fact very very important and
such feedback is indeed very useful

if i just get a sitemap from sponger (which is wrapping a sitemap from
another site)
then all i can do is really just crawling that sitemap which would
call the sponger to be banned from the remote site which is beiign
wrapped.

a way to avoid that is to implement a mechanism by which the sitemap
tells me what it is doing "hey i wrap amazon, so i can be involked
with anmes of people and tell you books they might havewritten or
names of book and give you prieces or something else" and then sindice
could use that on the fly when a request comes.

... this makes sense.. but we're back to semantic web services are we not? :-)

i mean to be able to express the above sentence to the point where
sindice or who else knows when to invoke that wrapper we'd have to
come up with such a complicated description language that it would
simply.. never be adopted. (see the SWS lession)

Search engines, n the other hand, are indeed allowed to crawl Amazon
and other site. and do my own sponging, why not, google does it.

so i guess it could boil down to

a) got nmative RDFa ? ok we crawl you .. but we cant be that updated
b) got native RDFa and understand that it is a value for you for
engines to be very updated? then provide a dump. But we shoul dnot
forget what the actual web is doing either :-) e.g. so probably we
better start implementing assap this
http://www.readwriteweb.com/archives/real-time_web_protocol_pubsubhubbub_explained.php
 Pubsubhubhub thingie
c) got nothing? well we might crawl you normally with some spongers on
top? (but this is still something that puzzles me. One thing in
sindice is that everything that's in there hsa been explicitly stated
by data producers,.. if i started to do this i'd be losing the ability
to say this. Again i am not sure this really matters but you might
lose the ability to claim fair use by saying the entire system is
automated (AFAIU the main defense google has for collecting and show
(e.g. in the previus) all the material is that they.. collect all
automatically no human intervenction)


cheers

Giovanni



On Sun, Oct 18, 2009 at 7:57 AM, Hugh Glaser <hg@ecs.soton.ac.uk> wrote:
> Hi Guys,
> I am puzzled by the whole discussion, so will try to summarise to find out
> if I have some misunderstanding.
>
> It really is "just" about finding where the URIs are, and search engines are
> the game in town. We need to make it really easy for people to find the
> Linked Data URIs they need. Wrappers make things a bit harder.
>
> Juan asked if "Sindice crawled the whole regular web and checked the
> Spongers for each URL (sic!)".
> I read this as: "Can I use Sindice to find Linked Data URIs provided by the
> Spongers?" Or to put it yet another way, "Does Sindice index the part of the
> Semantic Web provided by the Spongers?"
>
> One way to do this would be to do what Juan suggests - model what the
> Spongers are doing, and then infer what the Linked Data URIs would be, based
> on the URLs of the underlying web pages, having crawled them.
>
> But there seems to me a much simpler and more principled way - the Sponger
> should do it.
> Spongers should provide Semantic Sitemaps (and of course voiD descriptions),
> so that Sindice can index (not *crawl*, which I think has lead to some of
> the confusion) the sites.
>
> How might this be done?
> Well, certainly where the Sponger is connected to a particular site which
> has an ordinary Sitemap, it could/should process it as part of the
> connection with a site, and then re-publish the Semantic Sitemap. For sites
> that don't have Sitemaps, it may/will be somewhat harder.
> I may be misunderstanding Spongers as well, but it all seems pretty clean
> and straightforward to me.
>
> Great stuff, of course.
>
> Best
> Hugh
>
>>>
>>> On Sat, Oct 17, 2009 at 3:32 PM, Juan Sequeda <juanfederico@gmail.com> wrote:
>>>
>>>> But Sindice could at least crawl Amazon.
>>>> It would be great to use sig.ma to create a "meshup" with the amazon data.
>>>>
>>>>
>>>> Juan Sequeda, Ph.D Student
>>>> Dept. of Computer Sciences
>>>> The University of Texas at Austin
>>>> www.juansequeda.com
>>>> www.semanticwebaustin.org
>>>>
>>>>
>>>> On Sat, Oct 17, 2009 at 9:28 AM, Martin Hepp (UniBW)
>>>> <hepp@ebusiness-unibw.org> wrote:
>>>>
>>>>> I don't think so, because this would require that Sindice crawled the
>>>>> whole regular web and checked the Spongers for each URL (sic!).
>>>>>
>>>>> Juan Sequeda wrote:
>>>>>
>>>>> Does Sindice crawl this (or any other semantic web search engines)?
>>>>> Juan Sequeda, Ph.D Student
>>>>> Dept. of Computer Sciences
>>>>> The University of Texas at Austin
>>>>> www.juansequeda.com
>>>>> www.semanticwebaustin.org
>>>>>
>
>
>
Received on Sunday, 18 October 2009 13:57:12 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:23 UTC