W3C home > Mailing lists > Public > public-vocabs@w3.org > March 2012

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

From: Tom Morris <tfmorris@gmail.com>
Date: Mon, 26 Mar 2012 11:44:48 -0400
Message-ID: <CAE9vqEHh-7=5D7ozoOev957t9vB=qRAMC85LJJKku7O11X+1Tg@mail.gmail.com>
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: Chris Bizer <chris@bizer.de>, public-vocabs@w3.org
On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp
<martin.hepp@ebusiness-unibw.org> wrote:
> Dear Chris, all:
> Thanks for your hard work on this, and it surely gives some input for further research.
> I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus.
> Just a simple example: The RDFa extractor details page
> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html
> says the extractors found 23,825 Entities of gr:Offering.
> In comparison, the few sites listed at
> http://wiki.goodrelations-vocabulary.org/Datasets
> alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more.
> I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample.

Just to be clear, are you claiming that that wiki listing of
specifically targeted data sets is *more representative* of the
totality of the web?

I know that CommonCrawl never claimed to be a *complete* corpus and
certainly all samples have bias, but, if anything, I'd thought that a
targeted curated list would have *more bias* than a (semi-?)random
automated web crawl.

It seems strange to criticize a sample by saying that it didn't find
all of the entities of a certain type.  One would never expect a
sample to do that.  If it found .1% of the GR entities while sample
.2% of the web, then there's a 50% undercount in the biased sample,
but saying it missed 99.9% of the entities ignores the nature of

I'd certainly be interested in numbers from the entire web (or a
completely unbiased sample), so if you've got those, feel free to


p.s. a hearty thank to Chris' team for doing the work and sharing the
data.  Some data beats no data every time.
Received on Monday, 26 March 2012 15:45:22 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:29:22 UTC