Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites from Tom Morris on 2012-03-26 (public-vocabs@w3.org from March 2012)

From: Tom Morris <tfmorris@gmail.com>
Date: Mon, 26 Mar 2012 11:44:48 -0400
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: Chris Bizer <chris@bizer.de>, public-vocabs@w3.org
Message-ID: <CAE9vqEHh-7=5D7ozoOev957t9vB=qRAMC85LJJKku7O11X+1Tg@mail.gmail.com>

On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp
<martin.hepp@ebusiness-unibw.org> wrote:
> Dear Chris, all:
>
> Thanks for your hard work on this, and it surely gives some input for further research.
>
> I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus.
>
> Just a simple example: The RDFa extractor details page
>
> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html
>
> says the extractors found 23,825 Entities of gr:Offering.
>
> In comparison, the few sites listed at
>
> http://wiki.goodrelations-vocabulary.org/Datasets
>
> alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more.
>
> I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample.
>

Just to be clear, are you claiming that that wiki listing of
specifically targeted data sets is *more representative* of the
totality of the web?

I know that CommonCrawl never claimed to be a *complete* corpus and
certainly all samples have bias, but, if anything, I'd thought that a
targeted curated list would have *more bias* than a (semi-?)random
automated web crawl.

It seems strange to criticize a sample by saying that it didn't find
all of the entities of a certain type.  One would never expect a
sample to do that.  If it found .1% of the GR entities while sample
.2% of the web, then there's a 50% undercount in the biased sample,
but saying it missed 99.9% of the entities ignores the nature of
sampling.

I'd certainly be interested in numbers from the entire web (or a
completely unbiased sample), so if you've got those, feel free to
share.

Tom

p.s. a hearty thank to Chris' team for doing the work and sharing the
data.  Some data beats no data every time.

Received on Monday, 26 March 2012 15:45:22 UTC