W3C home > Mailing lists > Public > public-vocabs@w3.org > March 2012

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Mon, 26 Mar 2012 20:31:23 +0200
Cc: "public-vocabs@w3.org Vocabularies" <public-vocabs@w3.org>, public-lod@w3.org
Message-Id: <4D348AD6-5142-4AE6-9E7B-AD8D2B015D84@ebusiness-unibw.org>
To: László Török <ltorokjr@gmail.com>
Hi,

a quote from the mission statement on the webdatacommons.org page:

"More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself."

I think this part of the communication is misleading.

How can I *use* the data if it covers only 0.0001 % of the Web? Most SPARQL query against this data for specific things will not yield any results then.

For instance, NYC has ca. 89,655 hotels [1]. If Common Crawl is that small a sample, it would at most include one hotel to search for. I cannot see how any useful application could be built on top of a corpus that is such a small sample of the Web.

Again, I am NOT criticizing the work per se, I just raised concerns that this is not "the Web" but another small, maybe representative, maybe biased sample of it.


[1] http://www.nycgo.com/articles/nyc-statistics-page

PS: I include public-lod@w3.org, because that may be a better audience than public-vocabs@w3.org


On Mar 26, 2012, at 6:38 PM, László Török wrote:

> 
> 
> Martin Hepp 2012. március 26., hétfő napon a következőt írta:
> I do not think that most people take Common Crawl as a 0.0001 % sample of the Web.
> 
> I personally hope for a wide domain coverage more then exhaustive crawl of individual domains. Since given Web Data Commons now everybody can try visiting the subset of URIs that one is interested in and try to do an exhaustive crawl looking for more data island embedded in pages that were not part of Common Crawl.
> 
> Laszlo
> So one can do many useful things with the data, and I already thanked everybody involved for it, but one should not take it as "all the structured data of the Web".
> 
> Martin
> 
> 
> On Mar 26, 2012, at 5:44 PM, Tom Morris wrote:
> 
> > On Mon, Mar 26, 2012 at 10:11 AM, Martin Hepp
> > <martin.hepp@ebusiness-unibw.org> wrote:
> >> Dear Chris, all:
> >>
> >> Thanks for your hard work on this, and it surely gives some input for further research.
> >>
> >> I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus.
> >>
> >> Just a simple example: The RDFa extractor details page
> >>
> >> http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html
> >>
> >> says the extractors found 23,825 Entities of gr:Offering.
> >>
> >> In comparison, the few sites listed at
> >>
> >> http://wiki.goodrelations-vocabulary.org/Datasets
> >>
> >> alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more.
> >>
> >> I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample.
> >>
> >
> > Just to be clear, are you claiming that that wiki listing of
> > specifically targeted data sets is *more representative* of the
> > totality of the web?
> >
> > I know that CommonCrawl never claimed to be a *complete* corpus and
> > certainly all samples have bias, but, if anything, I'd thought that a
> > targeted curated list would have *more bias* than a (semi-?)random
> > automated web crawl.
> >
> > It seems strange to criticize a sample by saying that it didn't find
> > all of the entities of a certain type.  One would never expect a
> > sample to do that.  If it found .1% of the GR entities while sample
> > .2% of the web, then there's a 50% undercount in the biased sample,
> > but saying it missed 99.9% of the entities ignores the nature of
> > sampling.
> >
> > I'd certainly be interested in numbers from the entire web (or a
> > completely unbiased sample), so if you've got those, feel free to
> > share.
> >
> > Tom
> >
> > p.s. a hearty thank to Chris' team for doing the work and sharing the
> > data.  Some data beats no data every time.
> 
> --------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
> 
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>         http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
> 
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
> * Project Main Page: http://purl.org/goodrelations/
> 
> 
> 
> 
> 
> 
> -- 
> László Török
> 

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/
Received on Monday, 26 March 2012 18:31:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 22 May 2012 06:49:00 GMT