Re: Chartering work has started for a Linked Data Signature Working Group @W3C from Dan Brickley on 2021-05-12 (semantic-web@w3.org from May 2021)

From: Dan Brickley <danbri@google.com>
Date: Wed, 12 May 2021 16:49:13 +0100
To: Manu Sporny <msporny@digitalbazaar.com>
Cc: Phil Archer <phil.archer@gs1.org>, Ivan Herman <ivan@w3.org>, Aidan Hogan <aidhog@gmail.com>, Markus Sabadello <markus@danubetech.com>, Pierre-Antoine Champin <pierre-antoine@w3.org>, Wendy Seltzer <wseltzer@w3.org>, semantic-web <semantic-web@w3.org>
Message-ID: <CAK-qy=4hNDk9XKXbsOCxfmog-tH5kjkdyKEn7ZmmLZtO85TvNQ@mail.gmail.com>
Ok, let's try to cut my piece of this this down. Speaking to eric@w3.org I
think we're on a similar page.

On Wed, 12 May 2021 at 04:02, Manu Sporny <msporny@digitalbazaar.com> wrote:

> Out of scope:
>
> * "Truth" should be out of scope, we're just trying to canonicalize a
> bunch of quads and digitally sign them, not determine if the statements
> truth can be evaluated as that can easily devolve into subjective truth
> vs. objective truth. We were very careful to avoid that tarpit in the
> W3C Verifiable Credentials WG.
>

The tarpit concern was why I was concerned about some of the grand language
initially floating around in the charter and nearby. A bunch of that has
been improved already.

Looking at https://w3c-ccg.github.io/ld-proofs/

"This specification describes a mechanism for ensuring the authenticity and
integrity of Linked Data documents using mathematical proofs."

... puts the entire weight of *ensuring* authenticity and integrity of RDF
aka Linked Data documents on this WG's TODO list.

The draft charter is more specific than that and less ambitious in tone
now. I understand that there are specific modest technical readings of both
'authenticity' and 'integrity', but they also have much broader and vaguer
everyday readings.


Looking at the four primary deliverables:

1) RDF Dataset Canonicalization

This is worth writing down at W3C, and as a REC, sure. It does not seem to
depend upon 2), 3) or 4), which is good. The charter should make clear that
this can proceed immediately and document any risk of slowdown from the
other specs in this WG.

It may help with canonicalization to give it a name and be clear that it is
one of potentially many "canonicalizations" that could be usefully applied
to RDF data, e.g. if you believe Dublin Core that
http://purl.org/dc/elements/1.1/title is equivalent to
http://purl.org/dc/terms/title; etc - but that is a slippery slope.

We should be clear that there are other circumstances where different forms
of canonicalization may be appropriate (e.g. as preprocessing), and that
this WG deliverable should not bear the burden of covering every form of
semantic "equivalence" amongst RDF graphs.

2) RDF Dataset Hash

Seems only to depend upon (1). This is good.

Probably worth being explicit that the sense of hashing doesn't have a goal
like "semantic hashing" from e.g.
https://www.cs.cmu.edu/~rsalakhu/papers/sdarticle.pdf where similar inputs
get similar hashes. Interesting, potentially useful, but out of scope.

3) Linked Data Integrity (LDI)

Most of the chartering complexity lives here. If this only depends upon 2)
rather than 1) that feels healthy - is that a correct reading?

* "Can I trust that the RDF ontology used to digitally sign the triples
> at the time was the same ontology that I'm using" is absolutely out of
> scope in this first iteration. It's an important question, but you can
> address many use cases with constrained and well known/stable ontologies.
>

I agree. Although knowing how to hash/sign/etc a bundle of cached pieces
would be a nice addition someday.

You can also address a bunch of use cases by even just hashing source
documents as document  (e.g. HTML+RDFa), it would be good to acknowledge
that using the RDF graph/dataset layer is not mandatory for all cases when
RDF content is to be signed. Signing the source bits can be perfectly
respectable, and the graph/dataset canonicalization spec can also have
other uses beyond signature.

This is also the deliverable most likely to attract attention from other
areas around W3C. Whether it's of the "what are those crazy semantic web
people up to?" or "there's some turf grabbing going on, because this
technology could be applied beyond core RDF usecases" flavours, it's the
place where things are going to be hardest to predict.

If you really just want the group to make RECs of the proposed inputs, I'd
suggest that the usecases prepared pre-WG get an equal amount of early
attention. Currently it looks the deliverables have already been drafted,
and the usecases will be assembled later to justify them. This could come
off as doing things backwards.


4) similar concerns to 3); again other parts of W3C are likely to say
"that's not particularly just an RDF thing to be doing" when it comes to
the registry aspects.



* OWL reasoning is out of scope. If folks want to kick off a WG to
> contemplate the ramifications of this work on OWL reasoners, great...
> but in a later group as that's a higher-order class of problem than the
> simpler, lower-level stuff the LDS WG charter is proposing.
>
> Dan, it feels like many of your concerns can be addressed by "Out of
> Scope" statements. It would be easier to understand what you wanted if
> you were to make some simple statements of the following form: "I'm
> concerned that X is going to derail the group; let's put X out of
> scope". It would be easier to analyze and process those sorts of
> statements.
>

Yeah,


> > If t-23236 says (of whatever entity / URI) "trueUntil": "Thursday",
> > ... "foo": "bar", or "pa12bg12f1g12c2": "FALSE", ... their ability to
> >  pollute the rest of the graph or make it unclear whether an asserter
> > of the graph has really asserted t-1, t-2, t-3, ...
>
> This is confusing to me -- isn't this a solved problem? This is why we
> created W3C Verifiable Credentials... so you can easily understand which
> entity said what, when they said it, the thing they said it about, and
> that you can draw a neat line around all of those things... all so you
> can avoid the graph pollution you refer to above. What am I missing?
>

It's about mixed expectations. If you sign the 100s of millions of triples
from Wikidata, or perhaps some subset, is Verifiable Credentials the right
technology? Perhaps.

If we were to say for example:

Deliverables 1) and 2) are straightforward with no complexities or
dependencies, they can just go ahead.

Deliverables 3) and 4) build upon these but have more ambition towards
being used in many important applications with complex requirements, ...

...and that therefore to make sure something useful is done, 3) and 4) more
explicitly prioritize Verifiable Credentials as their driving usecase.



>
> > The drafting around this WG seems to lean towards JSON-LD, where
> > there is some perceived ambivalence towards aspects of RDF (hi
> > Manu!:)
>
> Hi. :)
>
> Yes, there are some parts of RDF that we shouldn't be ambivalent
> towards, but should put out of scope so that the WG is tightly scoped so
> we can focus on the first couple of steps instead of it turning into a
> large expedition.
>
> > This is a legitimate point of view. JSON-LD is defined by its W3C
> > specifications and to some extent by the pragmatics of how it is
> > actually used, rather than the aggregate of the opinions of its
> > creators and spec editors. But it shines a light on whether this WG
> > is on board with what W3C claims RDF data structures mean, when
> > considered to be sets of statements about the world.
>
> I'm not sure anyone here could articulate what the W3C RDF specs mean
> because there is 25+ years of history here... there are many opinions
> and I don't think that discussion helps us get to a more focused charter.
>
> Putting things out of scope do... can we focus on that?
>
>
Let's try - possible text?

"Linked Data (RDF) is commonly understood to encode descriptions of real
world objects and claims about their properties and relationships.

Determining exactly how this works is explicitly and very much out of scope
of the WG.

Some RDF descriptions depend on "reference by description" conventions,
e.g. saying in markup "the Country whose name is France". Others use URIs
directly such as  http://dbpedia.org/resource/France or
https://www.wikidata.org/entity/Q142. Some of these identifiers use 'http:'
URIs, some use https: URIs, other URI schemas are sometimes encountered.
Some users of RDF are aware that, or rely upon, the fact that systems can
derive additional claims implied by instance data, by using content from
schemas and ontologies. Some RDF descriptions are written in self-contained
formats (e.g. N-Triples, RDF/XML, Turtle); others are written in formats
that depend on out of band material (e.g. JSON-LD contexts); in the latter
case, the RDF graph representation of content can vary even when the
instance data is untouched. RDF can also be written in forms in which
human-facing content and machine-oriented content are interwoven, but not
compelled to express the same claim (e.g. RDFa, Microdata). There is also
relatively little consensus amongst RDF applications about the conventions
best used for choosing named graph URIs associated with each triple in the
quads constituting a Dataset.

These complexities are real, and affect the environment around signed RDF
content, but not the immediate priority of this WG. The approach taken by
this WG is that its minimalistic deliverables should provide a foundation
of technology components and tools which can over time incrementally
address more challenging usecases. While WG members are encouraged to
consider the broader ecosystem in their designs (e.g. including hooks for
future extensions), the chartered work addresses real usecases, even if
other more challenging applications will need additional specifications,
conventions or best practice guidance.

I'm not going to reply to every point below, although the graphs vs
datasets aspect is worth revisiting later.

Maybe it helps to pack a bunch of out-of-scope into a paragraph describing
the larger surrounding baggage, rather than as a bulleted list?

"RDF Graphs" -- those are not what this group is focusing on, they
> create all sorts of provenance issues with the signed information...
> this is why we pushed hard for RDF Datasets back in the day... we're
> focusing on canonicalizing and generating proofs (e.g., digital
> signatures) for RDF Datasets.
>

That is something I am not getting so much from the charter, or from
talking to Ivan, Eric et al.


>
> Dan, at this point I have no idea if the above is helping or muddying.
> What I'd like from you is some sort of simple list of things that you
> think could derail the work (or take a ton of time). We could then
> easily mark each as in scope or out of scope (and then document that in
> the charter or the explainer).
>

I tried!

Short version: make sure (1) and (2) can happen with minimal coupling to
(3) and (4), and tone done any grandiosity in the language so that "this is
a step towards" is the tone, rather than "this will ensure...".

Dan



> -- manu
>
> --
> Manu Sporny (skype: msporny, twitter: manusporny)
> Founder/CEO - Digital Bazaar, Inc.
> blog: Veres One Decentralized Identifier Blockchain Launches
> https://tinyurl.com/veres-one-launches
>
Received on Wednesday, 12 May 2021 15:51:08 UTC