W3C home > Mailing lists > Public > public-vocabs@w3.org > March 2012

RE: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

From: Jim Rhyne <jrhyne@thematix.com>
Date: Mon, 26 Mar 2012 09:11:45 -0700
To: "'Tom Morris'" <tfmorris@gmail.com>, "'Martin Hepp'" <martin.hepp@ebusiness-unibw.org>
Cc: "'Chris Bizer'" <chris@bizer.de>, <public-vocabs@w3.org>
Message-ID: <014501cd0b6b$252fd3b0$6f8f7b10$@com>
Martin's points were that:
a) the sampling procedure used to establish the corpus is not well defined;
b) the results seem to show some bias.

This is helpful feedback to WebDataCommons.

Some data beats no data only when you know the characteristics of the "some


-----Original Message-----
From: Tom Morris [mailto:tfmorris@gmail.com] 
Sent: Monday, March 26, 2012 8:45 AM
To: Martin Hepp
Cc: Chris Bizer; public-vocabs@w3.org
Subject: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current
RDFa, Microdata and Miroformat data extracted from 65.4 million websites

On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp
<martin.hepp@ebusiness-unibw.org> wrote:
> Dear Chris, all:
> Thanks for your hard work on this, and it surely gives some input for
further research.
> I want to stress, though, that the absolute numbers found are NOT
representative for the Web as a whole, likely because of the limited
coverage of the CommonCrawl corpus.
> Just a simple example: The RDFa extractor details page
> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extract
> or_html-rdfa.html
> says the extractors found 23,825 Entities of gr:Offering.
> In comparison, the few sites listed at
> http://wiki.goodrelations-vocabulary.org/Datasets
> alone account for ca. 25,000,000 entities of that type, so obviously 1000
times more.
> I am not criticizing your valuable work, I only want to prevent people to
draw incorrect conclusions from the preliminary data, because the crawl does
not seem to be a *representative* or anywhere near *complete* corpus. So do
not not take the numbers for truth without any appropriate corrections for
bias in the sample.

Just to be clear, are you claiming that that wiki listing of specifically
targeted data sets is *more representative* of the totality of the web?

I know that CommonCrawl never claimed to be a *complete* corpus and
certainly all samples have bias, but, if anything, I'd thought that a
targeted curated list would have *more bias* than a (semi-?)random automated
web crawl.

It seems strange to criticize a sample by saying that it didn't find all of
the entities of a certain type.  One would never expect a sample to do that.
If it found .1% of the GR entities while sample .2% of the web, then there's
a 50% undercount in the biased sample, but saying it missed 99.9% of the
entities ignores the nature of sampling.

I'd certainly be interested in numbers from the entire web (or a completely
unbiased sample), so if you've got those, feel free to share.


p.s. a hearty thank to Chris' team for doing the work and sharing the data.
Some data beats no data every time.
Received on Monday, 26 March 2012 16:12:28 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:29:22 UTC