- From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
- Date: Mon, 26 Mar 2012 18:16:07 +0200
- To: Tom Morris <tfmorris@gmail.com>
- Cc: Chris Bizer <chris@bizer.de>, public-vocabs@w3.org
Tom: My impression was that many people understand the CommonCrawl to be a "full" crawl, to the extent possible, and *not a random sample*. Common Crawl's mission statement says it was a crawl of *the web'*. Not a sample of the Web - to the extent possible, of course. I am not claiming that the crawl *as a sample* is biased. I am just stressing that you cannot take the numbers and say anything in absolute numbers, and I bet that quite some people may be tempted to take the stats for stats of the Web, not stats from a sample whose representativeness is untested. Of course, my list of links is strongly biased. But if I know, from a manually compiled list of unsystematically collected sites, that there are > 25 million entities of gr:Offering on the public Web, then claiming there are 25,000 in a crawl of 1.7 bn pages seems at least strange. Either, the crawl contains just 0.0001 % of the Web or the RDFa vocabulary frequencies are not representative. I do not think that most people take Common Crawl as a 0.0001 % sample of the Web. So one can do many useful things with the data, and I already thanked everybody involved for it, but one should not take it as "all the structured data of the Web". Martin On Mar 26, 2012, at 5:44 PM, Tom Morris wrote: > On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp > <martin.hepp@ebusiness-unibw.org> wrote: >> Dear Chris, all: >> >> Thanks for your hard work on this, and it surely gives some input for further research. >> >> I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus. >> >> Just a simple example: The RDFa extractor details page >> >> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html >> >> says the extractors found 23,825 Entities of gr:Offering. >> >> In comparison, the few sites listed at >> >> http://wiki.goodrelations-vocabulary.org/Datasets >> >> alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more. >> >> I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample. >> > > Just to be clear, are you claiming that that wiki listing of > specifically targeted data sets is *more representative* of the > totality of the web? > > I know that CommonCrawl never claimed to be a *complete* corpus and > certainly all samples have bias, but, if anything, I'd thought that a > targeted curated list would have *more bias* than a (semi-?)random > automated web crawl. > > It seems strange to criticize a sample by saying that it didn't find > all of the entities of a certain type. One would never expect a > sample to do that. If it found .1% of the GR entities while sample > .2% of the web, then there's a 50% undercount in the biased sample, > but saying it missed 99.9% of the entities ignores the nature of > sampling. > > I'd certainly be interested in numbers from the entire web (or a > completely unbiased sample), so if you've got those, feel free to > share. > > Tom > > p.s. a hearty thank to Chris' team for doing the work and sharing the > data. Some data beats no data every time. -------------------------------------------------------- martin hepp e-business & web science research group universitaet der bundeswehr muenchen e-mail: hepp@ebusiness-unibw.org phone: +49-(0)89-6004-4217 fax: +49-(0)89-6004-4620 www: http://www.unibw.de/ebusiness/ (group) http://www.heppnetz.de/ (personal) skype: mfhepp twitter: mfhepp Check out GoodRelations for E-Commerce on the Web of Linked Data! ================================================================= * Project Main Page: http://purl.org/goodrelations/
Received on Monday, 26 March 2012 16:16:39 UTC