W3C home > Mailing lists > Public > public-schemaorg@w3.org > December 2020

a Schema Markup Validator: adopting SDTT for validator.schema.org

From: Dan Brickley <danbri@google.com>
Date: Tue, 15 Dec 2020 15:46:06 +0000
Message-ID: <CAK-qy=7iijqFX400R3u_YRYU5yzzAahv4VpLdeSzb+SQQ_G-Nw@mail.gmail.com>
To: "schema.org Mailing List" <public-schemaorg@w3.org>, Tom Marsh <tmarsh@exchange.microsoft.com>, St├ęphane Corlosquet <scorlosquet@gmail.com>, Yuliya Tihohod <tilid@yandex-team.ru>, "R.V. Guha" <guha@google.com>, Nicolas Torzec <torzecn@oath.com>
Schema.org folks (steering group, community group, everyone...),

https://github.com/schemaorg/schemaorg/issues/2790 tracks a proposal for a
validator.schema.org tool, to be based on Google SDTT, and to be
accompanied by opensource collaboration on data shape validation and parser
interoperability.

Today my Google colleagues are sharing Google's plans for the future of the
Google Structured Data Testing Tool (SDTT) - see
https://developers.google.com/search/blog/2020/12/structured-data-testing-tool-update.
The intent is to rework it into a vendor-neutral tool that can continue to
serve as a markup syntax checker for JSON-LD, Microdata, RDFa as used by
the communities around Schema.org. Although it could live on its own
independent domain, it would make a great addition to the Schema.org site,
and I would like to proceed in that direction in 2021, as part of Google's
long term commitment to hosting the Schema.org site and keeping it relevant
for schema.org users.

The basic idea is that the service now known as "Google Structured Data
Testing Tool" would stop making Google-product-specific data checks, but
continue  - as "Schema Markup Validator" - to serve as a robust tool for
checking JSON-LD, Microdata and RDFa schema markup. No validator (or
schema.org parser) is perfect, so part of this work will involve
documenting any shortcomings in the parsers/validators, and collaboration
with opensource implementers and standards makers towards improving the
ecosystem for everyone.

In addition to syntax validation, there is also the more futuristic topic
of "shape validation". For those unfamiliar with this distinction, syntax
validation is about helping publishers get the basic structure of JSON-LD,
Microdata, RDFa correct, whereas shape validation is about looking at the
extracted structured data and comparing it to the documented needs of
various online services, to see which features or tools it might be
eligible for. SDTT currently performs its own version of "shape checking"
to identify markup that matches the shapes needed by Google features, as
listed in https://developers.google.com/search/docs/guides/search-gallery.
However the intent is to turn this functionality *off*, so that the testing
tool becomes a simpler vendor-neutral offering focussed on correctness of
markup *syntax*.

In addition to adopting a "degooglified" SDTT as a syntax-level "Schema
Markup Validator", I would also like in 2021 to continue some collaboration
around shape validation. This is the idea of using relatively new web
standards (shacl, shex) to check structured data for matching specific data
patterns or "shapes". See https://en.wikipedia.org/wiki/SHACL and
https://en.wikipedia.org/wiki/ShEx, or the  free online book "Validating
RDF Data", https://book.validatingrdf.com/. Google recently opensourced
some Javascript software <https://github.com/google/schemarama/> in this
area, which brings together other opensource tooling to create a shape
validation system using both ShEx and SHACL. While it looks superficially
like SDTT, the focus is different: there is no syntax-level validation
(which is why the plan outlined above for SDTT is useful). Over time, we
can explore ways of integrating these different kinds of validation, but we
can make some very useful, simpler steps first by giving a reworked SDTT a
home under Schema.org.

I've linked some more detailed notes on SDTT from the issue at
https://github.com/schemaorg/schemaorg/issues/2790 - or see
https://docs.google.com/document/d/1q8z_rRJepiz4Os_KcEs3NaCVEm3US5l-qYL14JmE0To/edit#
directly. Feel free to follow up here, in Github or the doc, ...

cheers,

Dan
Received on Tuesday, 15 December 2020 15:46:40 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 15 December 2020 15:46:41 UTC