- From: Alexander Brinkmann <alexander.brinkmann@uni-mannheim.de>
- Date: Mon, 13 Jan 2025 09:36:58 +0000
- To: "public-schemaorg@w3.org" <public-schemaorg@w3.org>, "semantic-web@w3.org" <semantic-web@w3.org>
- Message-ID: <32a05f4a271443b28344d8b0565c26e1@uni-mannheim.de>
Hi all, We are happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the October 2024 version of the Common Crawl covering 2.4 billion HTML pages which originate from 37.4 million websites (pay-level domains). In summary, we found structured data within 1.3 billion HTML pages out of the 2.4 billion pages in the crawl (51.25%). These pages originate from 16.5 million different pay-level domains out of the 37.4 million pay-level domains covered by the crawl (44.12%). Altogether, the extracted data sets consist of 74 billion RDF quads. Approximately 11.5 million websites provide structured data using the JSON-LD syntax, 7.6 million websites use the Microdata markup format to annotate structured data within their pages, and 400 thousand websites were found to use the RDFa markup format. Statistics about the October 2024 Release: Basic statistics about the October 2024 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used along with each markup format are found at: https://webdatacommons.org/structureddata/2024-12/stats/stats.html Adoption of the Different Markup Formats The WebDataCommons project has been extracting structured data from the CommonCrawl yearly since 2010. The October 2024 release signifies 13 years of monitoring the adoption of structured data on the Web. This allows us to spot trends concerning the adoption of different markup formats as well as the usage of specific classes and properties, a short overview of which is provided on the page below: https://webdatacommons.org/structureddata/#results The first WDC release in 2010 revealed that only 5.7% of the examined web pages contained structured data. In 2024, we found structured data within 51.25% of the examined webpages indicating a huge growth in adoption over the last decade. JSON-LD and Microdata are the most widely used markup formats. Although there has been no substantial growth in the number of PLDs using Microdata, the number of PLDs using JSON-LD is steadily increasing. By 2024, JSON-LD and Microdata dominate over RDFa and other Microformats. JSON-LD is the most widely adopted markup format for structured data annotation, used by 70% of websites that annotate structured data. In comparison, Microdata is used by 46% of websites, while RDFa and Microformats (hCard) are used by only 3% and 23% of websites, respectively. The analysis of the richness of Microdata and JSON-LD annotations, measured by the average number of triples per webpage, shows an upward trend over the years. In 2010, an average of 21 Microdata triples were extracted from each webpage. By 2024, this number had increased to 38. JSON-LD annotations provide even more detailed information than Microdata annotations, with the average number of triples per webpage continuously increasing from 10 in 2015 to 57 in 2024. Adoption of the Schema.org Vocabulary The schema.org vocabulary remains the most popular in the context of Microdata and JSON-LD. It is used for annotating navigation elements within webpages, using classes such as WebPage, SearchAction and BreadcrumbList, as well as page content, such as Product, LocalBusiness, and JobPosting. We observe a rapidly increasing adoption of several content classes: Since 2017 the number of websites providing Product annotations rose from 581K to 3.3M (570% growth), and those annotating LocalBusiness entities increased from 231K to 1.5M (649% growth). The adoption of the JobPosting class surged from 7K to 63K websites (900% growth). Finally, we observe that an increasing number of websites explicitly annotate entity identifiers, such as product identifiers, as well as other identifying attributes such as telephone numbers or geo coordinates for local businesses. Schema.org provides different terms for annotating different types of product identifiers, with schema:Product/sku being the most popular among them. Over the past five years, the relative adoption of the schema:Product/sku property has increased from 21% to 60%. The property schema:LocalBusiness/telephone has also seen comparable increased growth in the last five years from 64% to 77%. This verifies our previous observation on the increasing richness of the annotations. Download all Data (N-QUADS) The overall size of the October 2024 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 74 billion RDF quads. For download, we split the data into 13.395 files with a total size of 1.4 TB. http://webdatacommons.org/structureddata/2024-12/stats/how_to_get_the_data.html Download Schema.org Subset (N-QUADS) We have also created class-specific subsets for 50 popular schema.org classes such as product, local business, event, and job posting in order to support the focused download of specific types of data. http://webdatacommons.org/structureddata/2024-12/stats/schema_org_subsets.html Lots of thanks to: + The Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. + The Any23 project for providing and maintaining their great library of structured data parsers. Have fun with the new data. Cheers, Alexander Brinkmann and Chris Bizer
Received on Monday, 13 January 2025 09:37:08 UTC