W3C home > Mailing lists > Public > public-vocabs@w3.org > March 2012

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Mon, 26 Mar 2012 18:16:07 +0200
Cc: Chris Bizer <chris@bizer.de>, public-vocabs@w3.org
Message-Id: <D56F5E28-E54E-48EF-B16B-DDBE99CE9FD5@ebusiness-unibw.org>
To: Tom Morris <tfmorris@gmail.com>

My impression was that many people understand the CommonCrawl to be a "full" crawl, to the extent possible, and *not a random sample*.

Common Crawl's mission statement says it was a crawl of *the web'*. Not a sample of the Web - to the extent possible, of course.

I am not claiming that the crawl *as a sample* is biased. I am just stressing that you cannot take the numbers and say anything in absolute numbers, and I bet that quite some people may be tempted to take the stats for stats of the Web, not stats from a sample whose representativeness is untested.

Of course, my list of links is strongly biased. But if I know, from a manually compiled list of unsystematically collected sites, that there are > 25 million entities of gr:Offering on the public Web, then claiming there are 25,000 in a crawl of 1.7 bn pages seems at least strange. Either, the crawl contains just 0.0001 % of the Web or the RDFa vocabulary frequencies are not representative.

I do not think that most people take Common Crawl as a 0.0001 % sample of the Web. 

So one can do many useful things with the data, and I already thanked everybody involved for it, but one should not take it as "all the structured data of the Web".


On Mar 26, 2012, at 5:44 PM, Tom Morris wrote:

> On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp
> <martin.hepp@ebusiness-unibw.org> wrote:
>> Dear Chris, all:
>> Thanks for your hard work on this, and it surely gives some input for further research.
>> I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus.
>> Just a simple example: The RDFa extractor details page
>> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html
>> says the extractors found 23,825 Entities of gr:Offering.
>> In comparison, the few sites listed at
>> http://wiki.goodrelations-vocabulary.org/Datasets
>> alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more.
>> I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample.
> Just to be clear, are you claiming that that wiki listing of
> specifically targeted data sets is *more representative* of the
> totality of the web?
> I know that CommonCrawl never claimed to be a *complete* corpus and
> certainly all samples have bias, but, if anything, I'd thought that a
> targeted curated list would have *more bias* than a (semi-?)random
> automated web crawl.
> It seems strange to criticize a sample by saying that it didn't find
> all of the entities of a certain type.  One would never expect a
> sample to do that.  If it found .1% of the GR entities while sample
> .2% of the web, then there's a 50% undercount in the biased sample,
> but saying it missed 99.9% of the entities ignores the nature of
> sampling.
> I'd certainly be interested in numbers from the entire web (or a
> completely unbiased sample), so if you've got those, feel free to
> share.
> Tom
> p.s. a hearty thank to Chris' team for doing the work and sharing the
> data.  Some data beats no data every time.

martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
* Project Main Page: http://purl.org/goodrelations/
Received on Monday, 26 March 2012 16:16:39 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:29:22 UTC