- From: Tom Morris <tfmorris@gmail.com>
- Date: Mon, 26 Mar 2012 11:44:48 -0400
- To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
- Cc: Chris Bizer <chris@bizer.de>, public-vocabs@w3.org
On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp <martin.hepp@ebusiness-unibw.org> wrote: > Dear Chris, all: > > Thanks for your hard work on this, and it surely gives some input for further research. > > I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus. > > Just a simple example: The RDFa extractor details page > > http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html > > says the extractors found 23,825 Entities of gr:Offering. > > In comparison, the few sites listed at > > http://wiki.goodrelations-vocabulary.org/Datasets > > alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more. > > I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample. > Just to be clear, are you claiming that that wiki listing of specifically targeted data sets is *more representative* of the totality of the web? I know that CommonCrawl never claimed to be a *complete* corpus and certainly all samples have bias, but, if anything, I'd thought that a targeted curated list would have *more bias* than a (semi-?)random automated web crawl. It seems strange to criticize a sample by saying that it didn't find all of the entities of a certain type. One would never expect a sample to do that. If it found .1% of the GR entities while sample .2% of the web, then there's a 50% undercount in the biased sample, but saying it missed 99.9% of the entities ignores the nature of sampling. I'd certainly be interested in numbers from the entire web (or a completely unbiased sample), so if you've got those, feel free to share. Tom p.s. a hearty thank to Chris' team for doing the work and sharing the data. Some data beats no data every time.
Received on Monday, 26 March 2012 15:45:22 UTC