Re: The Power of Virtuoso Sponger Technology from Martin Hepp (UniBW) on 2009-10-18 (public-lod@w3.org from October 2009)

From: Martin Hepp (UniBW) <martin.hepp@ebusiness-unibw.org>
Date: Sun, 18 Oct 2009 09:37:14 +0200
To: giovanni.tummarello@deri.org
CC: Juan Sequeda <juanfederico@gmail.com>, hepp@ebusiness-unibw.org, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <4ADAC5AA.4050906@ebusiness-unibw.org>
Guys,
the Web of Data cannot rely on mass data crawling of the whole Web but 
must combine cached data with federated on-demand queries. Structured 
data requires much faster update cycles than typical text-based Web 
indices. For example, Google and Yahoo can rely on the fact that 
"http://www.cnn.com" is relevant for "news". Such will not change within 
minutes. And both Google and Yahoo need up to several weeks to visit 
your page again.

When it comes to structured price and availability information, your 
data may become outdated within hours, if not seconds. Think of eBay 
auctions, hotel or flight availability, etc.

So it will boil down to technology that combines (1) crawling and 
caching rather stable data sets with (2) distributing queries and parts 
of queries among the right SPARQL endpoints (whatever actual DB 
technology they expose).

You can keep a text index of the whole Web, if crawling cycles in the 
order of magnitude of weeks are fine. For structured, linked data that 
exposes dynamic database content, "dumb" crawling and caching will not 
scale.

If the DB technology is able to involve the right set of endpoints for 
parts of the query, why would you need a complete replication of all 
databases in the world inside one huge repository?

That repository will be a million-node cluster anyway. Why not directly 
use the millions of nodes that provide the data and cache just the 
endpoint meta-data?

Martin



Giovanni Tummarello wrote:
> With respect to crawling and "scraping" or "sponging" or .. "trying to
> guess" based on partial fragments of structured information i can say
> 3 thngs
>
> a) No, we're not doing it at the moment, we are only covering those
> who chose to put structured semantics. Some book stuff shows up in
> Sig.ma .. e.g. http://sig.ma/search?q=frank+van+harmelen&sources=100
> bookfinder, our jerome digital library installation, but the triplees
> they provide are scarce and dont contribute much.  It would take so
> little for this to improve on their side i believe.
>
> b) No, we are not religious about this. We have talked about it
> several times, it might make sense to try to understand as much as the
> web as possible and index it. Maybe we'll do it in the future for
> selected fractions of the web to show how it looks
>
> c) crawling should be just one mean of acquiring the semantic web. in
> case of bestbuy or other large retailers where prices change possibly
> everyday crawling as a mean to emulate a simple.. call to a web
> service seems really not the smart thing to do. Will data providers
> really support with data dumps?
>
> cheers
> Giovanni
>
>
> On Sat, Oct 17, 2009 at 3:32 PM, Juan Sequeda <juanfederico@gmail.com> wrote:
>   
>> But Sindice could at least crawl Amazon.
>> It would be great to use sig.ma to create a "meshup" with the amazon data.
>>
>>
>> Juan Sequeda, Ph.D Student
>> Dept. of Computer Sciences
>> The University of Texas at Austin
>> www.juansequeda.com
>> www.semanticwebaustin.org
>>
>>
>> On Sat, Oct 17, 2009 at 9:28 AM, Martin Hepp (UniBW)
>> <hepp@ebusiness-unibw.org> wrote:
>>     
>>> I don't think so, because this would require that Sindice crawled the
>>> whole regular web and checked the Spongers for each URL (sic!).
>>>
>>> Juan Sequeda wrote:
>>>
>>> Does Sindice crawl this (or any other semantic web search engines)?
>>> Juan Sequeda, Ph.D Student
>>> Dept. of Computer Sciences
>>> The University of Texas at Austin
>>> www.juansequeda.com
>>> www.semanticwebaustin.org
>>>
>>>
>>> On Sat, Oct 17, 2009 at 4:24 AM, Martin Hepp (UniBW) <
>>> hepp@ebusiness-unibw.org> wrote:
>>>
>>>
>>>
>>> Dear all:
>>>
>>> I just found out that the Virtuoso Sponger technology is even more
>>> powerful than I thought.
>>>
>>> Briefly: "Spongers" create rich GoodRelations (and other RDF) meta-data
>>> for existing Web pages on-the-fly. Other than traditional
>>> screen-scraping approaches, Spongers reuse public APIs and other
>>> techniques, so the data is of unprecedented degree of structure.
>>>
>>> Now, this can be directly used in arbitrary queries... by simply using
>>> the URI of the *existing* HTML Web page in the FROM clause of a SPARQL
>>> query.
>>>
>>> Example:
>>>
>>>
>>>
>>>
>>> http://www.amazon.com/Semantic-Web-Real-World-Applications-Industry/dp/0387485309
>>>
>>> is a Web page in plain HTML offering a book. Amazon does not yet produce
>>> GoodRelations meta-data on their pages.
>>>
>>> If you go to
>>>
>>>    http://uriburner.com/sparql
>>>
>>> and paste the URI in the "Default Graph URI " field and select "Retrieve
>>> remote RDF for all missing source graphs", then a query like
>>>
>>>   "SELECT * WHERE {?s ?p ?o} LIMIT 50"
>>>
>>> returns a fully-fledged GoodRelations description for that page - as if
>>> Amazon was already supporting GoodRelations for each of its > 4 million
>>> items!
>>>
>>> There are spongers for BestBuy, eBay, Zillow, and many other types of
>>> resources.
>>>
>>> Wow!
>>>
>>> Congrats to Kingsley and his team!
>>>
>>> Best wishes
>>>
>>> Martin Hepp
>>>
>>> --
>>> --------------------------------------------------------------
>>> martin hepp
>>> e-business & web science research group
>>> universitaet der bundeswehr muenchen
>>>
>>> e-mail:  hepp@ebusiness-unibw.org
>>> phone:   +49-(0)89-6004-4217
>>> fax:     +49-(0)89-6004-4620
>>> www:     http://www.unibw.de/ebusiness/ (group)
>>>         http://www.heppnetz.de/ (personal)
>>> skype:   mfhepp
>>> twitter: mfhepp
>>>
>>> Check out GoodRelations for E-Commerce on the Web of Linked Data!
>>> =================================================================
>>>
>>> Webcast:
>>> http://www.heppnetz.de/projects/goodrelations/webcast/
>>>
>>> Recipe for Yahoo SearchMonkey:
>>> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
>>>
>>> Talk at the Semantic Technology Conference 2009:
>>> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
>>>
>>>
>>> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287
>>>
>>> Overview article on Semantic Universe:
>>>
>>>
>>> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html
>>>
>>> Project page:
>>> http://purl.org/goodrelations/
>>>
>>> Resources for developers:
>>> http://www.ebusiness-unibw.org/wiki/GoodRelations
>>>
>>> Tutorial materials:
>>> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
>>> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
>>>
>>>
>>> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> --------------------------------------------------------------
>>> martin hepp
>>> e-business & web science research group
>>> universitaet der bundeswehr muenchen
>>>
>>> e-mail:  hepp@ebusiness-unibw.org
>>> phone:   +49-(0)89-6004-4217
>>> fax:     +49-(0)89-6004-4620
>>> www:     http://www.unibw.de/ebusiness/ (group)
>>>          http://www.heppnetz.de/ (personal)
>>> skype:   mfhepp
>>> twitter: mfhepp
>>>
>>> Check out GoodRelations for E-Commerce on the Web of Linked Data!
>>> =================================================================
>>>
>>> Webcast:
>>> http://www.heppnetz.de/projects/goodrelations/webcast/
>>>
>>> Recipe for Yahoo SearchMonkey:
>>> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
>>>
>>> Talk at the Semantic Technology Conference 2009:
>>> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
>>>
>>> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287
>>>
>>> Overview article on Semantic Universe:
>>>
>>> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html
>>>
>>> Project page:
>>> http://purl.org/goodrelations/
>>>
>>> Resources for developers:
>>> http://www.ebusiness-unibw.org/wiki/GoodRelations
>>>
>>> Tutorial materials:
>>> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
>>> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
>>>
>>> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709
>>>
>>>       
>>     
>
>
>   

-- 
--------------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================

Webcast:
http://www.heppnetz.de/projects/goodrelations/webcast/

Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey

Talk at the Semantic Technology Conference 2009: 
"Semantic Web-based E-Commerce: The GoodRelations Ontology"
http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287

Overview article on Semantic Universe:
http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html

Project page:
http://purl.org/goodrelations/

Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelations

Tutorial materials:
CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey 
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709
Received on Sunday, 18 October 2009 07:37:48 UTC