ANN: Corpus of 147 million quasi-relational Web tables released for public download

Hi all,

 

the Web Data Commons team is happy to announce the release of a corpus
containing 147 million quasi-relational Web tables.

 

The Web contains vast amounts of HTML tables. Most of these tables are used
for layout purposes, but a fraction of the tables is also quasi-relational,
meaning that they contain structured data describing a set of entities.  

 

A corpus of Web tables can be useful for research and applications in areas
such as data search, table augmentation, knowledge base construction, and
for various NLP tasks. Of course Web tables are not exactly Linked Data, but
we believe that it could make sense for some applications to bend this data
with data from the LOD cloud as well as other Web data like RDFa and
Microformat data.

 

The WDC Web Tables corpus has been extracted from the 2012 version of the
Common Crawl [1], the largest Web crawl that is available to the public. The
corpus contains the subset of the 11 billion HTML tables found in the Common
Crawl that are likely quasi-relational.

 

More information about the corpus, its application domains as well as
information about how to download the corpus is found at

 

http://webdatacommons.org/webtables/

 

We want to thanks the Common Crawl Foundation for providing their great web
crawl and thus enabling the creation of the WDC Web Tables corpus. 

 

The creation of the WDC Web Tables corpus was supported by the German
Research Foundation (DFG), the EU FP7 project PlanetData and by Amazon Web
Services.  We thank our sponsors a lot.

 

 

Enjoy the new coprus!

 

Best regards,

 

Petar Ristoski, Oliver Lehmberg, Heiko Paulheim, Robert Meusel, and
Christian Bizer

 

 

[1] http://commoncrawl.org/

 

--

Prof. Dr. Christian Bizer

Data and Web Science Research Group

Universität Mannheim, Germany 
chris@informatik.uni-mannheim.de

www.bizer.de

 

Received on Thursday, 6 March 2014 13:34:08 UTC