- From: Martin Hepp (UniBW) <martin.hepp@ebusiness-unibw.org>
- Date: Sun, 18 Oct 2009 09:37:14 +0200
- To: giovanni.tummarello@deri.org
- CC: Juan Sequeda <juanfederico@gmail.com>, hepp@ebusiness-unibw.org, "public-lod@w3.org" <public-lod@w3.org>
- Message-ID: <4ADAC5AA.4050906@ebusiness-unibw.org>
Guys, the Web of Data cannot rely on mass data crawling of the whole Web but must combine cached data with federated on-demand queries. Structured data requires much faster update cycles than typical text-based Web indices. For example, Google and Yahoo can rely on the fact that "http://www.cnn.com" is relevant for "news". Such will not change within minutes. And both Google and Yahoo need up to several weeks to visit your page again. When it comes to structured price and availability information, your data may become outdated within hours, if not seconds. Think of eBay auctions, hotel or flight availability, etc. So it will boil down to technology that combines (1) crawling and caching rather stable data sets with (2) distributing queries and parts of queries among the right SPARQL endpoints (whatever actual DB technology they expose). You can keep a text index of the whole Web, if crawling cycles in the order of magnitude of weeks are fine. For structured, linked data that exposes dynamic database content, "dumb" crawling and caching will not scale. If the DB technology is able to involve the right set of endpoints for parts of the query, why would you need a complete replication of all databases in the world inside one huge repository? That repository will be a million-node cluster anyway. Why not directly use the millions of nodes that provide the data and cache just the endpoint meta-data? Martin Giovanni Tummarello wrote: > With respect to crawling and "scraping" or "sponging" or .. "trying to > guess" based on partial fragments of structured information i can say > 3 thngs > > a) No, we're not doing it at the moment, we are only covering those > who chose to put structured semantics. Some book stuff shows up in > Sig.ma .. e.g. http://sig.ma/search?q=frank+van+harmelen&sources=100 > bookfinder, our jerome digital library installation, but the triplees > they provide are scarce and dont contribute much. It would take so > little for this to improve on their side i believe. > > b) No, we are not religious about this. We have talked about it > several times, it might make sense to try to understand as much as the > web as possible and index it. Maybe we'll do it in the future for > selected fractions of the web to show how it looks > > c) crawling should be just one mean of acquiring the semantic web. in > case of bestbuy or other large retailers where prices change possibly > everyday crawling as a mean to emulate a simple.. call to a web > service seems really not the smart thing to do. Will data providers > really support with data dumps? > > cheers > Giovanni > > > On Sat, Oct 17, 2009 at 3:32 PM, Juan Sequeda <juanfederico@gmail.com> wrote: > >> But Sindice could at least crawl Amazon. >> It would be great to use sig.ma to create a "meshup" with the amazon data. >> >> >> Juan Sequeda, Ph.D Student >> Dept. of Computer Sciences >> The University of Texas at Austin >> www.juansequeda.com >> www.semanticwebaustin.org >> >> >> On Sat, Oct 17, 2009 at 9:28 AM, Martin Hepp (UniBW) >> <hepp@ebusiness-unibw.org> wrote: >> >>> I don't think so, because this would require that Sindice crawled the >>> whole regular web and checked the Spongers for each URL (sic!). >>> >>> Juan Sequeda wrote: >>> >>> Does Sindice crawl this (or any other semantic web search engines)? >>> Juan Sequeda, Ph.D Student >>> Dept. of Computer Sciences >>> The University of Texas at Austin >>> www.juansequeda.com >>> www.semanticwebaustin.org >>> >>> >>> On Sat, Oct 17, 2009 at 4:24 AM, Martin Hepp (UniBW) < >>> hepp@ebusiness-unibw.org> wrote: >>> >>> >>> >>> Dear all: >>> >>> I just found out that the Virtuoso Sponger technology is even more >>> powerful than I thought. >>> >>> Briefly: "Spongers" create rich GoodRelations (and other RDF) meta-data >>> for existing Web pages on-the-fly. Other than traditional >>> screen-scraping approaches, Spongers reuse public APIs and other >>> techniques, so the data is of unprecedented degree of structure. >>> >>> Now, this can be directly used in arbitrary queries... by simply using >>> the URI of the *existing* HTML Web page in the FROM clause of a SPARQL >>> query. >>> >>> Example: >>> >>> >>> >>> >>> http://www.amazon.com/Semantic-Web-Real-World-Applications-Industry/dp/0387485309 >>> >>> is a Web page in plain HTML offering a book. Amazon does not yet produce >>> GoodRelations meta-data on their pages. >>> >>> If you go to >>> >>> http://uriburner.com/sparql >>> >>> and paste the URI in the "Default Graph URI " field and select "Retrieve >>> remote RDF for all missing source graphs", then a query like >>> >>> "SELECT * WHERE {?s ?p ?o} LIMIT 50" >>> >>> returns a fully-fledged GoodRelations description for that page - as if >>> Amazon was already supporting GoodRelations for each of its > 4 million >>> items! >>> >>> There are spongers for BestBuy, eBay, Zillow, and many other types of >>> resources. >>> >>> Wow! >>> >>> Congrats to Kingsley and his team! >>> >>> Best wishes >>> >>> Martin Hepp >>> >>> -- >>> -------------------------------------------------------------- >>> martin hepp >>> e-business & web science research group >>> universitaet der bundeswehr muenchen >>> >>> e-mail: hepp@ebusiness-unibw.org >>> phone: +49-(0)89-6004-4217 >>> fax: +49-(0)89-6004-4620 >>> www: http://www.unibw.de/ebusiness/ (group) >>> http://www.heppnetz.de/ (personal) >>> skype: mfhepp >>> twitter: mfhepp >>> >>> Check out GoodRelations for E-Commerce on the Web of Linked Data! >>> ================================================================= >>> >>> Webcast: >>> http://www.heppnetz.de/projects/goodrelations/webcast/ >>> >>> Recipe for Yahoo SearchMonkey: >>> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey >>> >>> Talk at the Semantic Technology Conference 2009: >>> "Semantic Web-based E-Commerce: The GoodRelations Ontology" >>> >>> >>> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287 >>> >>> Overview article on Semantic Universe: >>> >>> >>> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html >>> >>> Project page: >>> http://purl.org/goodrelations/ >>> >>> Resources for developers: >>> http://www.ebusiness-unibw.org/wiki/GoodRelations >>> >>> Tutorial materials: >>> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on >>> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey >>> >>> >>> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709 >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> -------------------------------------------------------------- >>> martin hepp >>> e-business & web science research group >>> universitaet der bundeswehr muenchen >>> >>> e-mail: hepp@ebusiness-unibw.org >>> phone: +49-(0)89-6004-4217 >>> fax: +49-(0)89-6004-4620 >>> www: http://www.unibw.de/ebusiness/ (group) >>> http://www.heppnetz.de/ (personal) >>> skype: mfhepp >>> twitter: mfhepp >>> >>> Check out GoodRelations for E-Commerce on the Web of Linked Data! >>> ================================================================= >>> >>> Webcast: >>> http://www.heppnetz.de/projects/goodrelations/webcast/ >>> >>> Recipe for Yahoo SearchMonkey: >>> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey >>> >>> Talk at the Semantic Technology Conference 2009: >>> "Semantic Web-based E-Commerce: The GoodRelations Ontology" >>> >>> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287 >>> >>> Overview article on Semantic Universe: >>> >>> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html >>> >>> Project page: >>> http://purl.org/goodrelations/ >>> >>> Resources for developers: >>> http://www.ebusiness-unibw.org/wiki/GoodRelations >>> >>> Tutorial materials: >>> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on >>> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey >>> >>> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709 >>> >>> >> > > > -- -------------------------------------------------------------- martin hepp e-business & web science research group universitaet der bundeswehr muenchen e-mail: hepp@ebusiness-unibw.org phone: +49-(0)89-6004-4217 fax: +49-(0)89-6004-4620 www: http://www.unibw.de/ebusiness/ (group) http://www.heppnetz.de/ (personal) skype: mfhepp twitter: mfhepp Check out GoodRelations for E-Commerce on the Web of Linked Data! ================================================================= Webcast: http://www.heppnetz.de/projects/goodrelations/webcast/ Recipe for Yahoo SearchMonkey: http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey Talk at the Semantic Technology Conference 2009: "Semantic Web-based E-Commerce: The GoodRelations Ontology" http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287 Overview article on Semantic Universe: http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html Project page: http://purl.org/goodrelations/ Resources for developers: http://www.ebusiness-unibw.org/wiki/GoodRelations Tutorial materials: CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709
Received on Sunday, 18 October 2009 07:37:48 UTC