Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

Hi Martin and Peter,

cc'ing Ahad and Lisa from CommonCrawl 

> Hi Chris,
>
> Thanks for your e-mail. 
>
>> we clearly say on the WebDataCommons website as well as in the 
>> announcement that we are extracting data from 1.4 billion web pages only.
>> 
>> The Web is obviously much larger. Thus it is also obvious that we 
>> don't have all data in our dataset.
>
> It's not about the fact that you are using a subset of the Web, but that
that subset 
> is likely an unsuited sample from the population for many of the
conclusions you 
> derive, in particular speaking about the data Web.

Drawing conclusions from a sample is of course always questionable and
obviously it would be better if there would be a public 10 billion or 50
billion pages crawl available that we could analyze. But up-till-now such a
crawl does not exist. Thus, analyzing what we have is as good as we
currently can get based on publically accessible corpora.

In order to have a second source of evidence, I asked Peter to derive
statistics from (a subset of?) the Yahoo!/Bing crawl and he was so nice to
also provide these statistics for LDOW:

http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.pdf

His sample is bigger (3.4 billion pages gathered using a different crawling
strategy) and you can clearly see from the results that the crawling
strategy highly influences the results.

Up till now, Peter's statistics don't contain counts for specific classes.
Having them and comparing them to the WebDataCommons statistics would of
course be very interesting.

Peter: Do you see any chances that you still generate instance per class
counts after being back from WWW2012?

>> I agree with you that a crawler that would especially look for data 
>> would use a different crawling strategy.
>
> I (and likely many others) understood from your marketing and your slides
that you 
> were actually looking for data, and the core of my comments regarding
webdatacommons.org 
> was that the approach taken has a fundamental problem of reaching the data
due to the
> inappropriate filter by pagerank in the underlying CommonCrawl corpus.
>
> As for providing seed URLs: The problem is that many sites will have data
markup ONLY 
> in the deep pages, so if they are not included in your data, you will not
even know 
> whether it pays out to try a particular site.

As far as I understood from an earlier email from Ahad, page rank is not the
only factor that the CC crawler uses for deciding on how deep to dig into a
specific website.

Ahad and Lisa: There is currently a discussion on some Semantic Web mailing
lists about what pages are likely to be included into the CommonCrawl. See:
http://lists.w3.org/Archives/Public/public-lod/2012Apr/thread.html

In order to clear up things, would it be possible that you give us some more
information about the CC crawling strategy and the factors that determine
how many pages are crawled per website?

>> Thus if you don't like the CommonCrawl crawling strategy, you are 
>> highly invited to change the ranking algorithm in any way you like, 
>> dig deeper into the websites that we identified and publish the resulting
data.
>
> I have clearly articulated that I think both CommonCrawl and
WebDataCommons are in 
> principle nice pieces of work.
>
> The only thing I did not like is that you do not discuss the limitations
of your analysis, 
> neither in the paper nor on the slides, which leads to many people drawing
the wrong 
> conclusions from your findings or even investing time and money into doing
something 
> directly on that data, which cannot work.

We will mention these limitations in further publications and presentations
of the WDC statistics.

>> This would be a really useful service to the community in addition to 
>> criticizing other people's work.
>
> Criticizing other people's work is the daily business of scientific
advancement and 
> while maybe unpleasant to the recipient indeed a useful service to the
community. 
> But I think you know that.

Sure, that why I write "in addition to".

Cheers,

Chris


> Martin


On Apr 18, 2012, at 12:11 AM, Chris Bizer wrote:

> Hi Martin,
> 
> we clearly say on the WebDataCommons website as well as in the 
> announcement that we are extracting data from 1.4 billion web pages only.
> 
> The Web is obviously much larger. Thus it is also obvious that we 
> don't have all data in our dataset.
> 
> See 
> http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0093.html for
the original announcement.
> 
> Quote from the announcement:
> 
> "We hope that Web Data Commons will be useful to the community by:
> 
> + easing the access to Mircodata, Mircoformat and RDFa data, as you do 
> + not
> need to crawl the Web yourself anymore in order to get access to a 
> fair portion of the structured data that is currently available on the
Web.
> 
> + laying the foundation for the more detailed analysis of the 
> + deployment of
> the different technologies.
> 
> + providing seed URLs for focused Web crawls that dig deeper into the
> websites that offer a specific type of data."
> 
> Please notice the words "fair portion", "more detailed analysis" and 
> "seed URLs for focused Web crawls".
> 
> I agree with you that a crawler that would especially look for data 
> would use a different crawling strategy.
> 
> The source code of the CommonCrawl crawler as well as the 
> WebDataCommons extraction code is available online under open licenses.
> 
> Thus if you don't like the CommonCrawl crawling strategy, you are 
> highly invited to change the ranking algorithm in any way you like, 
> dig deeper into the websites that we identified and publish the resulting
data.
> 
> This would be a really useful service to the community in addition to 
> criticizing other people's work.
> 
> Cheers,
> 
> Chris
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Martin Hepp [mailto:martin.hepp@unibw.de]
> Gesendet: Dienstag, 17. April 2012 15:26
> An: public-vocabs@w3.org Vocabularies; public-lod@w3.org; Chris Bizer
> Betreff: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads 
> current RDFa, Microdata and Miroformat data extracted from 65.4 
> million websites
> 
> Dear Chris, all,
> 
> while reading the paper [1] I think I found a possible explanation why 
> WebDataCommons.org does not fulfill the high expectations regarding 
> the completeness and coverage.
> 
> It seems that CommonCrawl filters pages by Pagerank in order to 
> determine the feasible subset of URIs for the crawl. While this may be 
> okay for a generic Web crawl, for linguistics purposes, or for 
> training machine-learning components, it is a dead end if you want to 
> extract structured data, since the interesting markup typically 
> resides in the *deep
> links* of dynamic Web applications, e.g. the product item pages in 
> shops, the individual event pages in ticket systems, etc.
> 
> Those pages often have a very low Pagerank, even when they are part of 
> very prestigious Web sites with a high Pagerank for the main landing page.
> 
> Example:
> 
> 1. Main page: 	http://www.wayfair.com/ 
> --> Pagerank 5 of 10
> 
> 2. Category page:	http://www.wayfair.com/Lighting-C77859.html
> --> Pagerank 3 of 10
> 
> 3. Item page:
> http://www.wayfair.com/Golden-Lighting-Cerchi-Flush-Mount-in-Chrome-10
> 30-FM-
> CH-GNL1849.html
> --> Pagerank of 0 / 10
> 
> Now, the RDFa on this site is in the 2 Million item pages only. 
> Filtering out the deep link in the original crawl means you are 
> removing the HTML that contains the actual data.
> 
> In your paper [1], you kind of downplay that limitation by saying that 
> this approach yielded "snapshots of the popular part of the web.". I 
> think "popular" is very misleading in here because the Pagerank does 
> not work very well for the "deep" Web, because those pages are 
> typically lacking external links almost completely, and due to their 
> huge number per site, they earn only a minimal Pagerank from their 
> main site, which provides the link or links.
> 
> So, once again, I think your approach is NOT suitable for yielding a 
> corpus of usable data at Web scale, and the statistics you derive are 
> likely very much skewed, because you look only at landing pages and 
> popular overview pages of sites, while the real data is in HTML pages 
> not contained in the basic crawl.
> 
> Please interprete your findings in the light of these limitations. I 
> am saying this so strongly because I already saw many tweets 
> cherishing the paper as "now we have the definitive statistics on 
> structured data on the Web".
> 
> 
> Best wishes
> 
> Martin
> 
> Note: For estimating the Pagerank in this example, I used the 
> online-service [2], which may provide only an approximation.
> 
> 
> [1] 
> http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-2.pdf
> 
> [2] http://www.prchecker.info/check_page_rank.php
> 
> --------------------------------------------------------
> martin hepp
> e-business & web science research group universitaet der bundeswehr 
> muenchen
> 
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>         http://www.heppnetz.de/ (personal)
> skype:   mfhepp 
> twitter: mfhepp
> 
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
> * Project Main Page: http://purl.org/goodrelations/
> 
> 

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/

Received on Wednesday, 18 April 2012 08:33:58 UTC