RE: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

Martin's points were that:
a) the sampling procedure used to establish the corpus is not well defined;
b) the results seem to show some bias.

This is helpful feedback to WebDataCommons.

Some data beats no data only when you know the characteristics of the "some
data".

Jim

-----Original Message-----
From: Tom Morris [mailto:tfmorris@gmail.com] 
Sent: Monday, March 26, 2012 8:45 AM
To: Martin Hepp
Cc: Chris Bizer; public-vocabs@w3.org
Subject: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current
RDFa, Microdata and Miroformat data extracted from 65.4 million websites

On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp
<martin.hepp@ebusiness-unibw.org> wrote:
> Dear Chris, all:
>
> Thanks for your hard work on this, and it surely gives some input for
further research.
>
> I want to stress, though, that the absolute numbers found are NOT
representative for the Web as a whole, likely because of the limited
coverage of the CommonCrawl corpus.
>
> Just a simple example: The RDFa extractor details page
>
> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extract
> or_html-rdfa.html
>
> says the extractors found 23,825 Entities of gr:Offering.
>
> In comparison, the few sites listed at
>
> http://wiki.goodrelations-vocabulary.org/Datasets
>
> alone account for ca. 25,000,000 entities of that type, so obviously 1000
times more.
>
> I am not criticizing your valuable work, I only want to prevent people to
draw incorrect conclusions from the preliminary data, because the crawl does
not seem to be a *representative* or anywhere near *complete* corpus. So do
not not take the numbers for truth without any appropriate corrections for
bias in the sample.
>

Just to be clear, are you claiming that that wiki listing of specifically
targeted data sets is *more representative* of the totality of the web?

I know that CommonCrawl never claimed to be a *complete* corpus and
certainly all samples have bias, but, if anything, I'd thought that a
targeted curated list would have *more bias* than a (semi-?)random automated
web crawl.

It seems strange to criticize a sample by saying that it didn't find all of
the entities of a certain type.  One would never expect a sample to do that.
If it found .1% of the GR entities while sample .2% of the web, then there's
a 50% undercount in the biased sample, but saying it missed 99.9% of the
entities ignores the nature of sampling.

I'd certainly be interested in numbers from the entire web (or a completely
unbiased sample), so if you've got those, feel free to share.

Tom

p.s. a hearty thank to Chris' team for doing the work and sharing the data.
Some data beats no data every time.

Received on Monday, 26 March 2012 16:12:28 UTC