W3C home > Mailing lists > Public > public-vocabs@w3.org > March 2012

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

From: László Török <ltorokjr@gmail.com>
Date: Mon, 26 Mar 2012 18:38:55 +0200
Message-ID: <CAMQXnefhCeGtA4-2BZgibhxi+kdH-3bub_roGWAVNXMXP8t43Q@mail.gmail.com>
To: "public-vocabs@w3.org" <public-vocabs@w3.org>
Martin Hepp 2012. március 26., hétfő napon a következőt írta:
>
> I do not think that most people take Common Crawl as a 0.0001 % sample of
> the Web.
>
> I personally hope for a wide domain coverage more then exhaustive crawl of
individual domains. Since given Web Data Commons now everybody can try
visiting the subset of URIs that one is interested in and try to do an
exhaustive crawl looking for more data island embedded in pages that were
not part of Common Crawl.

Laszlo

> So one can do many useful things with the data, and I already thanked
> everybody involved for it, but one should not take it as "all the
> structured data of the Web".
>
> Martin
>
>
> On Mar 26, 2012, at 5:44 PM, Tom Morris wrote:
>
> > On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp
> > <martin.hepp@ebusiness-unibw.org <javascript:;>> wrote:
> >> Dear Chris, all:
> >>
> >> Thanks for your hard work on this, and it surely gives some input for
> further research.
> >>
> >> I want to stress, though, that the absolute numbers found are NOT
> representative for the Web as a whole, likely because of the limited
> coverage of the CommonCrawl corpus.
> >>
> >> Just a simple example: The RDFa extractor details page
> >>
> >>
> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html
> >>
> >> says the extractors found 23,825 Entities of gr:Offering.
> >>
> >> In comparison, the few sites listed at
> >>
> >> http://wiki.goodrelations-vocabulary.org/Datasets
> >>
> >> alone account for ca. 25,000,000 entities of that type, so obviously
> 1000 times more.
> >>
> >> I am not criticizing your valuable work, I only want to prevent people
> to draw incorrect conclusions from the preliminary data, because the crawl
> does not seem to be a *representative* or anywhere near *complete* corpus..
> So do not not take the numbers for truth without any appropriate
> corrections for bias in the sample.
> >>
> >
> > Just to be clear, are you claiming that that wiki listing of
> > specifically targeted data sets is *more representative* of the
> > totality of the web?
> >
> > I know that CommonCrawl never claimed to be a *complete* corpus and
> > certainly all samples have bias, but, if anything, I'd thought that a
> > targeted curated list would have *more bias* than a (semi-?)random
> > automated web crawl.
> >
> > It seems strange to criticize a sample by saying that it didn't find
> > all of the entities of a certain type.  One would never expect a
> > sample to do that.  If it found .1% of the GR entities while sample
> > .2% of the web, then there's a 50% undercount in the biased sample,
> > but saying it missed 99.9% of the entities ignores the nature of
> > sampling.
> >
> > I'd certainly be interested in numbers from the entire web (or a
> > completely unbiased sample), so if you've got those, feel free to
> > share.
> >
> > Tom
> >
> > p.s. a hearty thank to Chris' team for doing the work and sharing the
> > data.  Some data beats no data every time.
>
> --------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
>
> e-mail:  hepp@ebusiness-unibw.org <javascript:;>
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>         http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
>
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
> * Project Main Page: http://purl.org/goodrelations/
>
>
>
>
>

-- 
László Török
Received on Monday, 26 March 2012 16:45:04 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 22 May 2012 06:49:00 GMT