Re: ANN: BestBuy.com starts publishing full catalog as RDF/XML using GoodRelations - 27 million triples

Hi Jay, thanks for the quick response

first let me say that i am very enthusiastic about this :-) so thanks
for the efforts we'll do our best to contribute making them
successful.

There are some issues with the current semantic sitemap:

If a list of URLs with the details of specific objects is to be given
then a semantic sitemap should not be used and a normal sitemap used
instead. (simply following the normal specifications) Semantic
Sitemaps empower a more powerful use case where the data dump is given
or a sparql endpoint. In which case, with the dump, this can be
collected even everyday (as opposed to crawling the whole list which
takes days) and split "serrver side".

Would you have such dump to provide? if not please just revert to the
normal sitemap , no further questions asked :-)

if daily updates are importnat then please consider chaning the
sitemaps to something very simple where you simply give the
dumps.

Ideally if you could give them both in rdf(xml).tar.gz and in NQUADS
(simple indeed, just a file where each line corrisponds to a "quad" ,
) it would help us support linked data clients which want to find
documents to fetch, but a single RDF file is fine.

thanks again for your efforts.

Giovanni

p.s. just to test anway, we have hacked a crawl for now. :-)


2009/9/1 Myers, Jay <Jay.Myers@bestbuy.com>:
> All,
>
>
>
> Thanks for the insight. As far as the sitemap is concerned, I used the current sitemap protocol (http://www.sitemaps.org/schemas/sitemap/0.9). Since we are publishing around 452K documents, it seemed like the correct route to use sitemap index files, as one file would certainly contain over 50,000 URIs and be over 10MB. I’m not aware of another method in which to publish this amount of data in a sitemap J
>
>
>
> At this point, we have no SPARQL endpoint, we are simply publishing product data out via RDF. I am hoping that attention to this effort will be noticed by senior leadership, convincing them to sponsor a greater, more complete effort that could serve as a model for big business. Any suggestions on this would be welcome.
>
>
>
> Thanks,
>
>
>
> Jay
>
>
>
> Jay Myers
>
> Lead Web Development Engineer
>
> Online Solutions, BestBuy.com
>
> jay.myers@bestbuy.com
>
> (w) 612-291-4007
>
> (c) 612-296-5836
>
> (twitter) @jaymyers
>
> (skype) jaymmyers
>
>
>
>
>
> ________________________________
>
> From: Martin Hepp (UniBW) [mailto:martin.hepp@ebusiness-unibw.org]
> Sent: Tuesday, September 01, 2009 8:14 AM
> To: giovanni.tummarello@deri.org
> Cc: public-lod@w3.org
> Subject: Re: ANN: BestBuy.com starts publishing full catalog as RDF/XML using GoodRelations - 27 million triples
>
>
>
> Hi Giovanni:
>
> Giovanni Tummarello wrote:
>
> Hi Martin, all,
>
>
>
>  the sitemap exposed is not a Semantic Sitemap
>
>
>
> Semantic Sitemap: http://products.semweb.bestbuy.com/sitemap.xml
>
>
>
> but simply gives the location of the dumps.
>
>
>
>
>
> As far as I see, the sitemap at
>
> http://products.semweb.bestbuy.com/sitemap.xml
>
> gives the locations of the compressed semantic sitemaps:
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
>     <sitemap>
>         <loc>http://products.semweb.bestbuy.com/sitemap1.xml.gz</loc>
>         <lastmod>2009-07-31T18:23:17+00:00</lastmod>
>     </sitemap>
>
>
> Each one of those seems to be a proper semantic sitemap
> E.g.
>
> http://products.semweb.bestbuy.com/sitemap1.xml.gz
>
> -->
>
> <?xml version="1.0" encoding="UTF-8"?>
> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd">
>     <sc:dataset>
>         <sc:datasetLabel>Sitemap data for Best Buy Co., Inc., products. Data based on http://purl.org/goodrelations/</sc:datasetLabel>
>         <sc:datasetURI>http://products.semweb.bestbuy.com/</sc:datasetURI>
>         <sc:linkedDataPrefix slicing="subject-object">http://products.semweb.bestbuy.com/</sc:linkedDataPrefix>
>         <sc:sampleURI>http://products.semweb.bestbuy.com/products/9380001/semanticweb.rdf</sc:sampleURI>
>         <sc:dataDumpLocation>http://products.semweb.bestbuy.com/products/43900/semanticweb.rdf</sc:dataDumpLocation>
>         <sc:dataDumpLocation>http://products.semweb.bestbuy.com/products/48521/semanticweb.rdf</sc:dataDumpLocation>
>         <sc:dataDumpLocation>http://products.semweb.bestbuy.com/products/48530/semanticweb.rdf</sc:dataDumpLocation>
>         <sc:dataDumpLocation>http://products.semweb.bestbuy.com/products/54256/semanticweb.rdf</sc:dataDumpLocation>
>
>
>
>
> in theory if this information is exposed as linked data then one would
>
> like to have a semantic sitemap exposed,
>
> As said - I understand BestBuy is using the main sitemap to bundle the individual semantic sitemaps. Note that they are dealing with 450,000 documents. A single sitemap file would be pretty large.
>
>
> which includes other details
>
> e.g. a sparql endpoint some information on the datasets etc. [1]
>
>
>
>
>
> There is, to my knowledge, no SPARQL endpoint offered by BestBuy.com, but you can soon simply use the Linked Open Commerce dataspace at
>
> http://loc.openlinksw.com/sparql
>
> This will contain a current copy of the bestbuy graphs.
>
> has this been considered and decided against?
>
> As far as I know, the combination of a sitemap and 23 semantic sitemaps was a pragmatic decision. If it causes major problems, Jay Myers from BestBuy will for sure be open to improvements for suggestions.
>
> should we just live with
>
> it and fit sindice to do some guesswork and process those instead? (i
>
> am not necessarely against this last solution really.. )
>
>
>
> You simply have to fetch and un-gzip the 23 semantic sitemaps at
>
> http://products.semweb.bestbuy.com/sitemap<n>.xml.gz
>
> with <n> being a number from 1 to 23.
>
> Note that
>
> http://products.semweb.bestbuy.com/sitemap5.xml.gz
>
> seems to have a syntactical problem (fix is already requested).
>
>
>
> In other words are you suggesting the use of semantic sitemaps
>
> We usually recommend using semantic sitemaps. But actually I think that a consolidated dataspace like the LOC will become more important in the future, because it creates to much overhead for each agent and application to crawl and consolidate the whole Web of Linked Data on his/her own.
>
>
> or
>
> should we just come to term to this? The disavantage is that linked
>
> data browser that wants to use an index to find information will be
>
> able to do so less reliably (hope that our guesswork works)
>
>
>
> As said - I understand (without a thorough analyis, though), that BestBuy's usage of a single sitemap and multiple semantic sitemaps is okay.
>
>
>
> Giovanni
>
>
>
> [1] http://sw.deri.org/2007/07/sitemapextension/
>
>
>
> On Mon, Aug 31, 2009 at 8:08 PM, Martin Hepp
>
> (UniBW)<martin.hepp@ebusiness-unibw.org> wrote:
>
>
>
> Dear all:
>
>
>
> BestBuy.com has just started to serve a complete RDF/XML dump of their
>
> products and price information to the Web of Linked Data, using the
>
> GoodRelations vocabulary for e-commerce. The data dump is updated on a
>
> daily basis and contains detailed descriptions for roughly 450,000
>
> individual items. With about 60 triples per item, this totals to about
>
> 27 million RDF triples.
>
>
>
> Semantic Sitemap: http://products.semweb.bestbuy.com/sitemap.xml
>
>
>
> Examples:
>
> a) Software:
>
> http://products.semweb.bestbuy.com/products/8182593/semanticweb.rdf
>
>
>
> b) "Hardgoods":
>
> http://products.semweb.bestbuy.com/products/8794691/semanticweb.rdf
>
>
>
> c) Movies:
>
> http://products.semweb.bestbuy.com/products/7590289/semanticweb.rdf
>
>
>
> d) Games:
>
> http://products.semweb.bestbuy.com/products/9223752/semanticweb.rdf
>
>
>
> Other than many existing large RDF transcripts, the data very dynamic,
>
> holding the daily prices for all items.
>
> According to Wikipedia, BestBuy.com is the largest specialty retailer of
>
> consumer electronics in the United States accounting for 19% of the market.
>
>
>
> It is likely the first Fortune 500 company to start publishing offer
>
> details on the Web of Linked Data.
>
>
>
> Congratulations to Jay Myers from BestBuy.com for this excellent
>
> contribution, and a big thanks to Andreas Radinger and Alex Stolz for
>
> their support,
>
>
>
> Best wishes
>
>
>
> Martin Hepp
>
>
>
> --
>
> --------------------------------------------------------------
>
> martin hepp
>
> e-business & web science research group
>
> universitaet der bundeswehr muenchen
>
>
>
> e-mail:  mhepp@computer.org
>
> phone:   +49-(0)89-6004-4217
>
> fax:     +49-(0)89-6004-4620
>
> www:     http://www.unibw.de/ebusiness/ (group)
>
>         http://www.heppnetz.de/ (personal)
>
> skype:   mfhepp
>
> twitter: mfhepp
>
>
>
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
>
> =================================================================
>
>
>
> Webcast:
>
> http://www.heppnetz.de/projects/goodrelations/webcast/
>
>
>
> Recipe for Yahoo SearcMonkey:
>
> http://tr.im/rAbN
>
>
>
> Talk at the Semantic Technology Conference 2009:
>
> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
>
> http://tinyurl.com/semtech-hepp
>
>
>
> Overview article on Semantic Universe:
>
> http://tinyurl.com/goodrelations-universe
>
>
>
> Project page:
>
> http://purl.org/goodrelations/
>
>
>
> Resources for developers:
>
> http://www.ebusiness-unibw.org/wiki/GoodRelations
>
>
>
> Tutorial materials:
>
> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
>
> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
>
> http://tr.im/grcec09
>
>
>
>
>
>
>
>
>
>
>
> --
>
> --------------------------------------------------------------
>
> martin hepp
>
> e-business & web science research group
>
> universitaet der bundeswehr muenchen
>
>
>
> e-mail:  mhepp@computer.org
>
> phone:   +49-(0)89-6004-4217
>
> fax:     +49-(0)89-6004-4620
>
> www:     http://www.unibw.de/ebusiness/ (group)
>
>          http://www.heppnetz.de/ (personal)
>
> skype:   mfhepp
>
> twitter: mfhepp
>
>
>
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
>
> =================================================================
>
>
>
> Webcast:
>
> http://www.heppnetz.de/projects/goodrelations/webcast/
>
>
>
> Recipe for Yahoo SearcMonkey:
>
> http://tr.im/rAbN
>
>
>
> Talk at the Semantic Technology Conference 2009:
>
> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
>
> http://tinyurl.com/semtech-hepp
>
>
>
> Overview article on Semantic Universe:
>
> http://tinyurl.com/goodrelations-universe
>
>
>
> Project page:
>
> http://purl.org/goodrelations/
>
>
>
> Resources for developers:
>
> http://www.ebusiness-unibw.org/wiki/GoodRelations
>
>
>
> Tutorial materials:
>
> CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
>
> http://tr.im/grcec09

Received on Tuesday, 1 September 2009 23:14:05 UTC