Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

Hi Chris,

Thanks for your e-mail. 

> we clearly say on the WebDataCommons website as well as in the announcement
> that we are extracting data from 1.4 billion web pages only. 
> 
> The Web is obviously much larger. Thus it is also obvious that we don't have
> all data in our dataset.

It's not about the fact that you are using a subset of the Web, but that that subset is likely an unsuited sample from the population for many of the conclusions you derive, in particular speaking about the data Web.

> I agree with you that a crawler that would especially look for data would
> use a different crawling strategy.

I (and likely many others) understood from your marketing and your slides that you were actually looking for data, and the core of my comments regarding webdatacommons.org was that the approach taken has a fundamental problem of reaching the data due to the inappropriate filter by pagerank in the underlying CommonCrawl corpus.

As for providing seed URLs: The problem is that many sites will have data markup ONLY in the deep pages, so if they are not included in your data, you will not even know whether it pays out to try a particular site.

> Thus if you don't like the CommonCrawl crawling strategy, you are highly
> invited to change the ranking algorithm in any way you like, dig deeper into
> the websites that we identified and publish the resulting data. 

I have clearly articulated that I think both CommonCrawl and WebDataCommons are in principle nice pieces of work.
The only thing I did not like is that you do not discuss the limitations of your analysis, neither in the paper nor on the slides, which leads to many people drawing the wrong conclusions from your findings or even investing time and money into doing something directly on that data, which cannot work.

> This would be a really useful service to the community in addition to
> criticizing other people's work.

Criticizing other people's work is the daily business of scientific advancement and while maybe unpleasant to the recipient indeed a useful service to the community. But I think you know that.

Martin


On Apr 18, 2012, at 12:11 AM, Chris Bizer wrote:

> Hi Martin,
> 
> we clearly say on the WebDataCommons website as well as in the announcement
> that we are extracting data from 1.4 billion web pages only. 
> 
> The Web is obviously much larger. Thus it is also obvious that we don't have
> all data in our dataset.
> 
> See http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0093.html for
> the original announcement.
> 
> Quote from the announcement:
> 
> "We hope that Web Data Commons will be useful to the community by:
> 
> + easing the access to Mircodata, Mircoformat and RDFa data, as you do not
> need to crawl the Web yourself anymore in order to get access to a fair
> portion of the structured data that is currently available on the Web.
> 
> + laying the foundation for the more detailed analysis of the deployment of
> the different technologies.
> 
> + providing seed URLs for focused Web crawls that dig deeper into the
> websites that offer a specific type of data."
> 
> Please notice the words "fair portion", "more detailed analysis" and "seed
> URLs for focused Web crawls".
> 
> I agree with you that a crawler that would especially look for data would
> use a different crawling strategy.
> 
> The source code of the CommonCrawl crawler as well as the WebDataCommons
> extraction code is available online under open licenses.
> 
> Thus if you don't like the CommonCrawl crawling strategy, you are highly
> invited to change the ranking algorithm in any way you like, dig deeper into
> the websites that we identified and publish the resulting data. 
> 
> This would be a really useful service to the community in addition to
> criticizing other people's work.
> 
> Cheers,
> 
> Chris
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Martin Hepp [mailto:martin.hepp@unibw.de] 
> Gesendet: Dienstag, 17. April 2012 15:26
> An: public-vocabs@w3.org Vocabularies; public-lod@w3.org; Chris Bizer
> Betreff: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current
> RDFa, Microdata and Miroformat data extracted from 65.4 million websites
> 
> Dear Chris, all,
> 
> while reading the paper [1] I think I found a possible explanation why
> WebDataCommons.org does not fulfill the high expectations regarding the
> completeness and coverage.
> 
> It seems that CommonCrawl filters pages by Pagerank in order to determine
> the feasible subset of URIs for the crawl. While this may be okay for a
> generic Web crawl, for linguistics purposes, or for training
> machine-learning components, it is a dead end if you want to extract
> structured data, since the interesting markup typically resides in the *deep
> links* of dynamic Web applications, e.g. the product item pages in shops,
> the individual event pages in ticket systems, etc.
> 
> Those pages often have a very low Pagerank, even when they are part of very
> prestigious Web sites with a high Pagerank for the main landing page.
> 
> Example:
> 
> 1. Main page: 	http://www.wayfair.com/ 
> --> Pagerank 5 of 10
> 
> 2. Category page:	http://www.wayfair.com/Lighting-C77859.html
> --> Pagerank 3 of 10
> 
> 3. Item page:
> http://www.wayfair.com/Golden-Lighting-Cerchi-Flush-Mount-in-Chrome-1030-FM-
> CH-GNL1849.html
> --> Pagerank of 0 / 10
> 
> Now, the RDFa on this site is in the 2 Million item pages only. Filtering
> out the deep link in the original crawl means you are removing the HTML that
> contains the actual data.
> 
> In your paper [1], you kind of downplay that limitation by saying that this
> approach yielded "snapshots of the popular part of the web.". I think
> "popular" is very misleading in here because the Pagerank does not work very
> well for the "deep" Web, because those pages are typically lacking external
> links almost completely, and due to their huge number per site, they earn
> only a minimal Pagerank from their main site, which provides the link or
> links.
> 
> So, once again, I think your approach is NOT suitable for yielding a corpus
> of usable data at Web scale, and the statistics you derive are likely very
> much skewed, because you look only at landing pages and popular overview
> pages of sites, while the real data is in HTML pages not contained in the
> basic crawl.
> 
> Please interprete your findings in the light of these limitations. I am
> saying this so strongly because I already saw many tweets cherishing the
> paper as "now we have the definitive statistics on structured data on the
> Web".
> 
> 
> Best wishes
> 
> Martin
> 
> Note: For estimating the Pagerank in this example, I used the online-service
> [2], which may provide only an approximation.
> 
> 
> [1] http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-2.pdf
> 
> [2] http://www.prchecker.info/check_page_rank.php
> 
> --------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
> 
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>         http://www.heppnetz.de/ (personal)
> skype:   mfhepp 
> twitter: mfhepp
> 
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
> * Project Main Page: http://purl.org/goodrelations/
> 
> 

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/

Received on Tuesday, 17 April 2012 22:46:19 UTC