- From: Jim Rhyne <jrhyne@thematix.com>
- Date: Mon, 26 Mar 2012 09:11:45 -0700
- To: "'Tom Morris'" <tfmorris@gmail.com>, "'Martin Hepp'" <martin.hepp@ebusiness-unibw.org>
- Cc: "'Chris Bizer'" <chris@bizer.de>, <public-vocabs@w3.org>
Martin's points were that: a) the sampling procedure used to establish the corpus is not well defined; b) the results seem to show some bias. This is helpful feedback to WebDataCommons. Some data beats no data only when you know the characteristics of the "some data". Jim -----Original Message----- From: Tom Morris [mailto:tfmorris@gmail.com] Sent: Monday, March 26, 2012 8:45 AM To: Martin Hepp Cc: Chris Bizer; public-vocabs@w3.org Subject: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp <martin.hepp@ebusiness-unibw.org> wrote: > Dear Chris, all: > > Thanks for your hard work on this, and it surely gives some input for further research. > > I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus. > > Just a simple example: The RDFa extractor details page > > http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extract > or_html-rdfa.html > > says the extractors found 23,825 Entities of gr:Offering. > > In comparison, the few sites listed at > > http://wiki.goodrelations-vocabulary.org/Datasets > > alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more. > > I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample. > Just to be clear, are you claiming that that wiki listing of specifically targeted data sets is *more representative* of the totality of the web? I know that CommonCrawl never claimed to be a *complete* corpus and certainly all samples have bias, but, if anything, I'd thought that a targeted curated list would have *more bias* than a (semi-?)random automated web crawl. It seems strange to criticize a sample by saying that it didn't find all of the entities of a certain type. One would never expect a sample to do that. If it found .1% of the GR entities while sample .2% of the web, then there's a 50% undercount in the biased sample, but saying it missed 99.9% of the entities ignores the nature of sampling. I'd certainly be interested in numbers from the entire web (or a completely unbiased sample), so if you've got those, feel free to share. Tom p.s. a hearty thank to Chris' team for doing the work and sharing the data. Some data beats no data every time.
Received on Monday, 26 March 2012 16:12:28 UTC