ANN: Web Data Commons - Schema.org Table Corpus published: 4.2 million tables filled with schema.org data from many websites

Hi all,

 

we are happy to announce the release of the WDC Schema.org Table Corpus. The
corpus consists of ~4.2 million relational tables describing a wide range of
entities and using the schema.org vocabulary as shared schema.

 

The three classes covered by the largest number of tables are Product (~2
million tables having overall ~232 million rows), Person (~922,000 tables
having overall ~6.6 million rows) and LocalBusiness (~466,000 tables having
overall ~7.4 million rows). Overall, 13 of the 43 classes are covered by
more than 10,000 tables each, another 7 classes are covered by more than
1,000 tables each.

 

The Web Data Commons project regularly extracts schema.org annotations from
the Common Crawl, a large public web corpus, and offers the extracted data
for public download in the form of RDF dumps. The WDC Schema.org Table
Corpus was generated by grouping the extracted RDF data from the December
2020 version of the WDC schema.org data sets into relational tables. A
single table contains all entities of a specific schema.org class that have
been extracted from a specific host (website). 

 

The overall size of all tables in zipped form is ~50 GB. For download, we
offer for each of the 43 Schema.org classes three separate files (in JSON
format) containing the Top 100 largest tables, the remaining tables with at
least 3 rows, and finally any smaller remaining tables. The table downloads
are accompanied by easy-to-view samples and further files containing
in-depth profiling statistics about the tables.

 

For more information about the corpus as well as for downloading it please
visit

 

http://webdatacommons.org/structureddata/schemaorgtables/

 

We plan to use the corpus ourselves for experimenting with supervised and
self-supervised data integration methods. But in addition to this, we hope
that the corpus might also prove useful for other people and their tasks as
it might be easier for them to process tabular data compared to the RDF
dumps that the Web Data Commons project traditionally publishes.

 

So have fun with the new corpus! 

 

Ralph Peeters and Chris Bizer

 

 

--

Christian Bizer

Data and Web Science Group

University of Mannheim, Germany 

http://dws.informatik.uni-mannheim.de/bizer

 

Received on Monday, 29 March 2021 12:44:49 UTC