Re: Chartering work has started for a Linked Data Signature Working Group @W3C from Gregg Kellogg on 2021-05-01 (semantic-web@w3.org from May 2021)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sat, 1 May 2021 11:02:27 -0700
To: Dan Brickley <danbri@danbri.org>
Cc: Ivan Herman <ivan@w3.org>, Ramanathan Guha <guha@google.com>, Dan Brickley <danbri@google.com>, semantic-web@w3.org
Message-Id: <030B89FC-23D3-4186-9BA4-BADC67EDD647@greggkellogg.net>
> On May 1, 2021, at 3:27 AM, Dan Brickley <danbri@danbri.org> wrote:
> 
> 
> I have concerns. If I had had more time I would have written a shorter email.
> 
> 
> 
> Starting from the top -
> 
> Is “Linked Data” in the group name serving as a synonym for RDF? 

Speaking for myself: I think that “Linked Data” became a proxy for the RDF data model in some cases. The narratives around “RDF”, “Linked Data” and “Semantic Web” have evolved. I consider “Semantic Web” to be largely aspirational, and largely abandoned by most current work. “Linked Data” should describe the eco-system of standards for describing how separate documents can either refer to each other, or talk about common resources.

RDF is the core data model / abstract syntax, and when we’re talking about a concrete or abstract dataset it is the most appropriate term.

> Are there in-scope usecases for non-RDF content? eg property graphs? RIF? Microformats? Plain XML, JSON?
> 
> Does saying “Linked Data” exclude any RDF practices deemed insufficiency “Linked”? 
> 
> The charter cites 
> http://webdatacommons.org/structureddata/#toc3 <http://webdatacommons.org/structureddata/#toc3> in support of the vague/ambiguous claim that “ The deployment of Linked Data <https://www.w3.org/standards/semanticweb/data> is increasing at a rapid pace <http://webdatacommons.org/structureddata/#toc3>”, yet the citation points to a document focussed on approaches which in various ways go against “Linked Data” orthodoxy, narrowly conceived. 
> 
> The webdatacommons report covers Microdata, RDFa, JSON-LD, and even Microformats; the latter effort has long distanced itself from RDF, Linked Data and so on. The others, as published in the public Web, are very commonly found embedded in containing documents (or even injected via Javascript into a running webplatform document object), and being used as standalone bnode-heavy descriptions rather than fragmentary pieces of hypertext RDF.

I would say that some involved have tried to distance themselves from RDF; I for one, have note. I see them all as being concrete RDF syntaxes, that may have some use in-and-of-themselves. (Microdata only in terms of the Microdata to RDF note, of course). Microformats are RDF only in so far as any structured format can be interpreted as RDF, if you try hard enough.

The heavy use of bnodes may prevent it from aligning with “Linked Data”, but it can still be RDF, and RDF-level specs should work well with it. But, that’s an ontological discussion.

> A particular problem with calling the group “Linked Data” is the expectation that the various (and contested) publishing practices associated with the Linked Data slogan will get tangled up in the technical work. 

+1

> For example, the Linked Data community emphasises public data, often but not always “Linked Open Data”, and has a strong bias towards RDF being published in a form such that all mentioned entities are described with a URI. It also has a bias toward those URIs being http(s)-dereferencable, with the resulting document containing additional RDF statements pertaining directly or indirectly to the entity the URI is considered to identify. Arcane rules regarding http redirect codes and the use of #-based identifiers for non-webplatform entities are also an important element of the post-2006 Linked Data tradition. 
> 
> By proposing to name the group “Linked Data” W3C risks embedding these contested design preferences in the technical work, while justifying the WG as impactful using the large scale adoption of practices bases on json-ld, microdata, rdfa which actively make different design choices from those implicitly endorsed by this naming choice.

I tend to agree, “Linked Data Security” doesn’t really seem to have anything to do with the “linked” part, and is more a term intended for a target audience which is not entirely comfortable with the “RDF” term. But, works have meaning, and IMHO we should embrace the data model.

> Specifically, Schema.org using these formats is on millions of sites (eg report led by webdatacommons), in large part by making the explicit choice to make things easier for publishers, e.g. by allowing them to write markup meaning roughly “the Country whose name is Paris” rather than following 
> Linked Data supposed best practice of simply using a well known URI for the entity, such as 
> http://dbpedia.org/resource/Paris <http://dbpedia.org/resource/Paris> (which would involve publishers finding out the mosg currently fashionable URI for every entity they mention). Signing data that mostly consists of dangling references to files on other people’s websites may be a solved mathematical problem, but it is new territory in social, policy, workflow, ecosystem and other ways. If W3C values such an endeavour it should be realistic in terms of staff resources assigned, and timelines. This is not a “quick win” project.
> 
> 
> The chartering issue is that “Linked Data” is a broad marketing euphemism for RDF that emphasises some but not all of its strengths, such as the ease of data merging across loosely coupled systems. But it is not a technical term or a W3C standard as such.

+1

> If this is effectively an RDF canonicalization WG there are other issues to discuss, such as its impact on expectations around schema evolution, linking, and security. 
> 
> Without being exhaustive, ...
> 
> Would it apply to schemas published at http: URIs or only https: URIs? 
> 
> Are we convinced that there is application-level value in having assurances over instance data without also having them for the schemas and ontologies they are underpinned by? 
> 
> Is there an expectation that schema/ontology publishing practice would need to change to accommodate these scenarios? 
> 
> Would schema-publishing organizations like Dublin Core, Schema.org, Wikidata, DBpedia, be expected to publish a JSON-LD (1.0? 1.1?) context file? What change management, versioning, etc practices would be required? Would special new schemas be needed instead?
> 
> For eg. if instance data created in 2019 uses a schema ex:Foo type last updated in 2021, but which has since 2018 contained an assertion of owl:equivalentClass to ex2:Bar, and an rdfs:subClassOf ex3:Xyz, are changes to the definitions of these supposed to be relevant to the trustability of the instance data? If so, why does 
> https://w3c.github.io/lds-wg-charter/index.html <https://w3c.github.io/lds-wg-charter/index.html> not discuss the role of schema/ontology definitions in all this? 

Signing a dataset and not the data inferred by that dataset?

> For concrete example of why 24 months looks ambitious:
> 
> The examples in 
> https://w3c-ccg.github.io/security-vocab/ <https://w3c-ccg.github.io/security-vocab/> 
> {
>   "@context": ["https://w3id.org/security/v1 <https://w3id.org/security/v1>", 
> "http://json-ld.org/contexts/person.jsonld <http://json-ld.org/contexts/person.jsonld>"]
>   "@type": "Person",
>   "name": "Manu Sporny",
>   "homepage": "http://manu.sporny.org/ <http://manu.sporny.org/>",
>   "signature": {
>     "@type": "GraphSignature2012",
>     "creator": "http://manu.sporny.org/keys/5 <http://manu.sporny.org/keys/5>",
>     "signatureValue": "OGQzNGVkMzVmMmQ3ODIyOWM32MzQzNmExMgoYzI4ZDY3NjI4NTIyZTk="
>   }
> }
> 
> This uses the following json-ld context:
> 
> http://json-ld.org/contexts/person.jsonld <http://json-ld.org/contexts/person.jsonld>
> 
> 
> ...which currently maps the term “Person” in the instance data to foaf:Person, which is a schema we have published in the FOAF project since ~ May 2000 or so, evolving the definition in place. We used to PGP sign the RDFS RDF/XML files btw; I am not entirely against signing and RDF! Nobody used it though.
> 
> From person.jsonld above,
> 
> {
>    "@context":
>    {
>       "Person": "http://xmlns.com/foaf/0.1/Person <http://xmlns.com/foaf/0.1/Person>",...
> 
> The current English definition of foaf:Person says “ The Person <http://xmlns.com/foaf/spec/#term_Person> class represents people. Something is a Person <http://xmlns.com/foaf/spec/#term_Person> if it is a person. We don't nitpic about whether they're alive, dead, real, or imaginary”. 
> Its rdf/xml (“Linked Data”) definition says, amongst other things, that it is owl:equivalentClass to schema:Person. 
T-Box vs A-Box? the graph states things where the inferred meaning might change if dependent sources (vocabularies/contexts) change. That said, the issue is different with JSON-LD, because the truth of the dataset is dependent on how it is interpreted via a context.

See some unfinished work in JSON-LD on context integrity: https://github.com/w3c/json-ld-syntax/issues/108 <https://github.com/w3c/json-ld-syntax/issues/108>. That bears coming back to sometime.

> Do we want a spec that cares about whether the context file is served over http? That cares if the dependency on FOAF is silently switched out, or whether the FOAF Person type’s “Linked Data” stated equivalence to 
> http://schema.org/Person <http://schema.org/Person> gets updated, e.g. to use https://schema.org <https://schema.org/> and/or to converge the written definitions which set the meaning of what it is to say that something is a foaf:Person or schema:Person. 
> 
> These are all fascinating issues but I would be astonished if the work gets done on the proposed schedule. The very idea of Linked Data puts these URI-facilitated connections between RDF graphs at its core. To omit discussion of their consequences in the charter is odd. For example, when is one the “authenticity and integrity” of one serialized / published graph dependent on that of another that it mentions/references/uses?
> 
> I am not against this work, but the draft charter feels really off somehow.
> 
> RDF with lots of blank nodes is known to be a bit annoying to consume, but easier to publish. The general sections of the charter make sweeping and grand claims about the utility of the proposed standards, and justify that with phrases like “authenticity and integrity of the data”  and references to the adoption of json-ld, microdata and rdfa in public web content. 
> 
> The usecases most explicitly listed are however largely from rather different perspective - a lot of blockchainy transactional scenarios, some frankly blueskies but intriguing:
> 
> “ For example, anchoring an RDF Dataset that expresses a land deed to a Distributed Ledger (aka blockchain) can establish a proof of existence in a way that does not depend on a single point of failure, such as a local government office“
> 
> ... which echoes TimBL’s old 
> https://www.w3.org/Talks/WWW94Tim/ <https://www.w3.org/Talks/WWW94Tim/> 
> 
> I do not want to see a repeat of the JSON-LD 1.0 vs 1.1 debacle, in which the massive success of Schema.org’s use of JSON-LD 1.0 in the public Web was used to persuade the W3C AC to launch a Working Group focussed on just those aspects of the technology (contexts) which don’t work well for the web scale search, and which didn’t address the needs of the project that had been uses to justify the WG. As discussed elsewhere this week, that effort resulted in W3C marking as superseded/abandoned the very technology (JSON-LD 1.0) that we at Schema.org were proud to have helped to success, and which we now can’t even reliably cite as a stable web standard.
> 
> If this WG is addressing needs around RDF for blockchains, or supporting software to compare, check and maybe diff RDF graphs, the charter should be clearer about this limited scope. 
> 
> The charter opens as follows:
> 
> “ There are a variety of established use cases, such as Verifiable Credentials <https://www.w3.org/TR/vc-data-model>, the publication of biological and pharmaceutical data, consumption of mission critical RDF vocabularies, and others, that depend on the ability to verify the authenticity and integrity of the data being consumed (see the use cases <https://w3c.github.io/lds-wg-charter/explainer.html#usage> for more examples).”
> 
> Currently the charter only alludes wavily to a “variety of established use cases”, and cites its specific “use cases” for “more”. The established ones also should be explicitly listed and analyzed to make sure they also motivate the proposed specific technical agenda, which is highly focussed on technicalities around bnode-labeling in RDF data.
> 
>  For each of these usecases we should ask, amongst other things, whether signing the raw bits might work, and if not, how much additional surrounding information is needed - eg base URI, referenced schemas/ontologies, json-ld contexts, GRDDL transformes; and whether the reference-tracing recurses or not. And why.
> 
> Sorry for the long note. I just don’t want to see another RIF-like 5 year slog happen because a cloud of similar ideas was mistaken for a shared standards-making agenda.

Good input, you have a great perspective. While I might take issue with the “debacle” of JSON-LD 1.1, the specific concerns about "superseded/abandoned” are being addressed, and are not only a JSON-LD 1.1 issue.

Specs change over time, and it doesn’t imply that everyone needs to use newly introduced features in any spec. But, as a consumer, you have reasonable expectations about what “profile” you will accept, and perhaps publishing something about the specific JSON-LD that you will consume would be useful. I wouldn’t be surprised if there are 1.0 features that you don’t take advantage of, either.

Gregg

> Cheers,
> 
> Dan
> 
> (Sent from my personal account but with a danbri@google.com <mailto:danbri@google.com> hat on)
> 
> On Tue, 6 Apr 2021 at 11:26, Ivan Herman <ivan@w3.org <mailto:ivan@w3.org>> wrote:
> Dear all,
> 
> the W3C has started to work on a Working Group charter for Linked Data Signatures:
> 
>     https://w3c.github.io/lds-wg-charter/index.html <https://w3c.github.io/lds-wg-charter/index.html>
> 
> The work proposed in this Working Group includes Linked Data Canonicalization, as well as algorithms and vocabularies for encoding digital proofs, such as digital signatures, and with that secure information expressed in serializations such as JSON-LD, TriG, and N-Quads.
> 
> The need for Linked Data canonicalization, digest, or signature has been known for a very long time, but it is only in recent years that research and development has resulted in mathematical algorithms and related implementations that are on the maturity level for a Web Standard. A separate explainer document:
> 
>    https://w3c.github.io/lds-wg-charter/explainer.html <https://w3c.github.io/lds-wg-charter/explainer.html>
> 
> provides some background, as well as a small set of use cases.
> 
> The W3C Credentials Community Group[1,2] has been instrumental in the work leading to this charter proposal, not the least due to its work on Verifiable Credentials and with recent applications and development on, e.g., vaccination passports using those technologies.
> 
> It must be emphasized, however, that this work is not bound to a specific application area or serialization. There are numerous use cases in Linked Data, like the publication of biological and pharmaceutical data, consumption of mission critical RDF vocabularies, and others, that depend on the ability to verify the authenticity and integrity of the data being consumed. This Working Group aims at covering all those, and we hope to involve the Linked Data Community at large in the elaboration of the final charter proposal.
> 
> We welcome your general expressions of interest and support. If you wish to make your comments public, please use GitHub issues:
> 
>    https://github.com/w3c/lds-wg-charter/issues <https://github.com/w3c/lds-wg-charter/issues>
> 
> A formal W3C Advisory Committee Review for this charter is expected in about six weeks.
> 
> [1] https://www.w3.org/community/credentials/ <https://www.w3.org/community/credentials/>
> [2] https://w3c-ccg.github.io/ <https://w3c-ccg.github.io/>
> 
> 
> ----
> Ivan Herman, W3C 
> Home: http://www.w3.org/People/Ivan/ <http://www.w3.org/People/Ivan/>
> mobile: +33 6 52 46 00 43
> ORCID ID: https://orcid.org/0000-0003-0782-2704 <https://orcid.org/0000-0003-0782-2704>
>
Received on Saturday, 1 May 2021 18:03:46 UTC