- From: Chris Bizer <chris@bizer.de>
- Date: Mon, 29 Mar 2021 14:44:26 +0200
- To: <public-schemaorg@w3.org>, <public-vocabs@w3.org>
- Message-ID: <045701d72499$4107b7a0$c31726e0$@bizer.de>
Hi all, we are happy to announce the release of the WDC Schema.org Table Corpus. The corpus consists of ~4.2 million relational tables describing a wide range of entities and using the schema.org vocabulary as shared schema. The three classes covered by the largest number of tables are Product (~2 million tables having overall ~232 million rows), Person (~922,000 tables having overall ~6.6 million rows) and LocalBusiness (~466,000 tables having overall ~7.4 million rows). Overall, 13 of the 43 classes are covered by more than 10,000 tables each, another 7 classes are covered by more than 1,000 tables each. The Web Data Commons project regularly extracts schema.org annotations from the Common Crawl, a large public web corpus, and offers the extracted data for public download in the form of RDF dumps. The WDC Schema.org Table Corpus was generated by grouping the extracted RDF data from the December 2020 version of the WDC schema.org data sets into relational tables. A single table contains all entities of a specific schema.org class that have been extracted from a specific host (website). The overall size of all tables in zipped form is ~50 GB. For download, we offer for each of the 43 Schema.org classes three separate files (in JSON format) containing the Top 100 largest tables, the remaining tables with at least 3 rows, and finally any smaller remaining tables. The table downloads are accompanied by easy-to-view samples and further files containing in-depth profiling statistics about the tables. For more information about the corpus as well as for downloading it please visit http://webdatacommons.org/structureddata/schemaorgtables/ We plan to use the corpus ourselves for experimenting with supervised and self-supervised data integration methods. But in addition to this, we hope that the corpus might also prove useful for other people and their tasks as it might be easier for them to process tabular data compared to the RDF dumps that the Web Data Commons project traditionally publishes. So have fun with the new corpus! Ralph Peeters and Chris Bizer -- Christian Bizer Data and Web Science Group University of Mannheim, Germany http://dws.informatik.uni-mannheim.de/bizer
Received on Monday, 29 March 2021 12:44:50 UTC