Re: The Power of Virtuoso Sponger Technology from Giovanni Tummarello on 2009-10-18 (public-lod@w3.org from October 2009)

From: Giovanni Tummarello <giovanni.tummarello@deri.org>
Date: Sun, 18 Oct 2009 15:33:36 +0100
To: martin.hepp@ebusiness-unibw.org
Cc: Juan Sequeda <juanfederico@gmail.com>, hepp@ebusiness-unibw.org, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <210271540910180733r726d9e1ehe07df7a148faeb33@mail.gmail.com>
I agree wihtt this, a combination of the 2, without into unrealistic
services descriptions, is exactly its the question.

its great to be talking about this.

I'd be gladly have a chat about all this at ISWC for those who are there?

Cheers
Giovanni


On Sun, Oct 18, 2009 at 8:37 AM, Martin Hepp (UniBW)
<martin.hepp@ebusiness-unibw.org> wrote:
> Guys,
> the Web of Data cannot rely on mass data crawling of the whole Web but must
> combine cached data with federated on-demand queries. Structured data
> requires much faster update cycles than typical text-based Web indices. For
> example, Google and Yahoo can rely on the fact that "http://www.cnn.com" is
> relevant for "news". Such will not change within minutes. And both Google
> and Yahoo need up to several weeks to visit your page again.
>
> When it comes to structured price and availability information, your data
> may become outdated within hours, if not seconds. Think of eBay auctions,
> hotel or flight availability, etc.
>
> So it will boil down to technology that combines (1) crawling and caching
> rather stable data sets with (2) distributing queries and parts of queries
> among the right SPARQL endpoints (whatever actual DB technology they
> expose).
>
> You can keep a text index of the whole Web, if crawling cycles in the order
> of magnitude of weeks are fine. For structured, linked data that exposes
> dynamic database content, "dumb" crawling and caching will not scale.
>
> If the DB technology is able to involve the right set of endpoints for parts
> of the query, why would you need a complete replication of all databases in
> the world inside one huge repository?
>
> That repository will be a million-node cluster anyway. Why not directly use
> the millions of nodes that provide the data and cache just the endpoint
> meta-data?
>
> Martin
>
>
>
> Giovanni Tummarello wrote:
>
> With respect to crawling and "scraping" or "sponging" or .. "trying to
> guess" based on partial fragments of structured information i can say
> 3 thngs
>
> a) No, we're not doing it at the moment, we are only covering those
> who chose to put structured semantics. Some book stuff shows up in
> Sig.ma .. e.g. http://sig.ma/search?q=frank+van+harmelen&sources=100
> bookfinder, our jerome digital library installation, but the triplees
> they provide are scarce and dont contribute much.  It would take so
> little for this to improve on their side i believe.
>
> b) No, we are not religious about this. We have talked about it
> several times, it might make sense to try to understand as much as the
> web as possible and index it. Maybe we'll do it in the future for
> selected fractions of the web to show how it looks
>
> c) crawling should be just one mean of acquiring the semantic web. in
> case of bestbuy or other large retailers where prices change possibly
> everyday crawling as a mean to emulate a simple.. call to a web
> service seems really not the smart thing to do. Will data providers
> really support with data dumps?
>
> cheers
> Giovanni
>
>
> On Sat, Oct 17, 2009 at 3:32 PM, Juan Sequeda <juanfederico@gmail.com>
> wrote:
>
>
> But Sindice could at least crawl Amazon.
> It would be great to use sig.ma to create a "meshup" with the amazon data.
>
>
> Juan Sequeda, Ph.D Student
> Dept. of Computer Sciences
> The University of Texas at Austin
> www.juansequeda.com
> www.semanticwebaustin.org
>
>
> On Sat, Oct 17, 2009 at 9:28 AM, Martin Hepp (UniBW)
> <hepp@ebusiness-unibw.org> wrote:
>
>
> I don't think so, because this would require that Sindice crawled the
> whole regular web and checked the Spongers for each URL (sic!).
>
> Juan Sequeda wrote:
>
> Does Sindice crawl this (or any other semantic web search engines)?
> Juan Sequeda, Ph.D Student
> Dept. of Computer Sciences
> The University of Texas at Austin
> www.juansequeda.com
> www.semanticwebaustin.org
>
>
> On Sat, Oct 17, 2009 at 4:24 AM, Martin Hepp (UniBW) <
> hepp@ebusiness-unibw.org> wrote:
>
>
>
> Dear all:
>
> I just found out that the Virtuoso Sponger technology is even more
> powerful than I thought.
>
> Briefly: "Spongers" create rich GoodRelations (and other RDF) meta-data
> for existing Web pages on-the-fly. Other than traditional
> screen-scraping approaches, Spongers reuse public APIs and other
> techniques, so the data is of unprecedented degree of structure.
>
> Now, this can be directly used in arbitrary queries... by simply using
> the URI of the *existing* HTML Web page in the FROM clause of a SPARQL
> query.
>
> Example:
>
>
>
>
> http://www.amazon.com/Semantic-Web-Real-World-Applications-Industry/dp/0387485309
>
> is a Web page in plain HTML offering a book. Amazon does not yet produce
> GoodRelations meta-data on their pages.
>
> If you go to
>
>    http://uriburner.com/sparql
>
> and paste the URI in the "Default Graph URI " field and select "Retrieve
> remote RDF for all missing source graphs", then a query like
>
>   "SELECT * WHERE {?s ?p ?o} LIMIT 50"
>
> returns a fully-fledged GoodRelations description for that page - as if
> Amazon was already supporting GoodRelations for each of its > 4 million
> items!
>
> There are spongers for BestBuy, eBay, Zillow, and many other types of
> resources.
>
> Wow!
>
> Congrats to Kingsley and his team!
>
> Best wishes
>
> Martin Hepp
>
> --
> --------------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
>
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>         http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
>
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
>
> Webcast:
> http://www.heppnetz.de/projects/goodrelations/webcast/
>
> Recipe for Yahoo SearchMonkey:
> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
>
> Talk at the Semantic Technology Conference 2009:
> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
>
>
> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287
>
> Overview article on Semantic Universe:
>
>
> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html
>
> Project page:
> http://purl.org/goodrelations/
>
> Resources for developers:
> http://www.ebusiness-unibw.org/wiki/GoodRelations
>
> Tutorial materials:
> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
>
>
> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709
>
>
>
>
>
>
>
>
> --
> --------------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
>
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>          http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
>
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
>
> Webcast:
> http://www.heppnetz.de/projects/goodrelations/webcast/
>
> Recipe for Yahoo SearchMonkey:
> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
>
> Talk at the Semantic Technology Conference 2009:
> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
>
> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287
>
> Overview article on Semantic Universe:
>
> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html
>
> Project page:
> http://purl.org/goodrelations/
>
> Resources for developers:
> http://www.ebusiness-unibw.org/wiki/GoodRelations
>
> Tutorial materials:
> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
>
> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709
>
>
>
>
>
>
>
> --
> --------------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
>
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>          http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
>
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
>
> Webcast:
> http://www.heppnetz.de/projects/goodrelations/webcast/
>
> Recipe for Yahoo SearchMonkey:
> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
>
> Talk at the Semantic Technology Conference 2009:
> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287
>
> Overview article on Semantic Universe:
> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html
>
> Project page:
> http://purl.org/goodrelations/
>
> Resources for developers:
> http://www.ebusiness-unibw.org/wiki/GoodRelations
>
> Tutorial materials:
> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709
>
>
Received on Sunday, 18 October 2009 14:34:31 UTC