Re: Thoughts on the LDS WG chartering discussion from Eric Prud'hommeaux on 2021-06-11 (semantic-web@w3.org from June 2021)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 11 Jun 2021 15:30:18 +0200
To: Dan Brickley <danbri@danbri.org>
Cc: David Booth <david@dbooth.org>, semantic-web@w3.org
Message-ID: <20210611133018.GC8464@w3.org>
On Fri, Jun 11, 2021 at 10:08:56AM +0100, Dan Brickley wrote:
> (Sorry, this is long.)
> 
> On Fri, 11 Jun 2021 at 00:19, David Booth <david@dbooth.org> wrote:
> 
> > On 6/10/21 11:08 AM, Ivan Herman wrote:
> > >> On 10 Jun 2021, at 16:13, David Booth <david@dbooth.org
> > >> I still feel like I am somehow missing a fundamental assumption that
> > >> others are making and I have not yet been able to identify.
> >
> 
> Know the feeling.
> 
> […]
> 
> The other thing that I still fundamentally do not yet grasp about the
> > proposed charter is this: Why is it restricted to RDF source documents?
> >    Clearly the canonicalization algorithm is about RDF, so that much I
> > understand.  But for the digital signature vocabulary, why wouldn't it
> > also be useful to be able to sign, say, a PDF document?  Why should the
> > RDF signing vocabulary be limited to talking about RDF documents?  Or am
> > I misunderstanding the intent here?   Perhaps if there were a simple,
> > complete example, it would help.  Again, I feel like I am missing some
> > of the assumed context.
> >
> 
> 
> This is a very reasonable and pertinent  question.
> 
> The vast vast majority of content on the web is not best considered as
> being purely RDF, even if you can often project out an RDF view of it. And
> the bits that are RDF-based rarely end up in anything most folks would
> consider an “RDF store”. But the distinction is a strange one, as explored
> below.
> 
> There is a *lot* of non-RDF data out there. I can hardly believe this needs
> emphasis in 2021 but here we are.
> 
> This content (or rather, the billions of web users touched by it) deserves
> ways of being assured by modern W3C-standard Signature.
> 
> We should not gloss over the scale of this. We are talking about petabytes
> of data here. Literally the entire contents of the web, for starters.
> 
> There are lots of other data-related formats out there in the web - CSS,
> CSV, SQL dumps, the email mbox format, iCalendar, vCARD, non-RDF property
> graphs,   MARC, YAML, XML itself of course, Protobufs, Apache Arrow,
> microformats, HTML5, HTML-anything, SGML, Midi files, MP3, WAV,
> Prolog/Datalog, OWL, N3, RIF, … package formats like ZIP, JAR, … image
> formats like PNG, JPG, GIF, … the SVG case is interesting (“this is
> definitely our logo”) but the others have embedded metadata too, EXIF not
> being RDF whilst XMP being RDF. Maybe OWL/RIF/N3 could use their “compiled
> down to triples” view, but should we extend that argument to everyone else?
> Rdf-star? What about SPARQL queries? Windows .ini files, …? PDFs, Flash,
> video file formats? In an age of misinformation facilitated by the web, it
> is not obvious that W3C’s next Data Signature WG should cover only “Linked”
> RDF data, and ignore media formats. What about robots.txt files? CBOR? The
> .sna format for snapshots of ZX Spectrum games? VMWare images? .iso disk
> images?
> 
> I could go on but
> https://en.m.wikipedia.org/wiki/List_of_file_formats exists. We didn’t even
> touch coding languages (JS, Java, JVM, WASM, GLSL, COBOL),  or notebook
> formats. Or .rtf files.
> 
> Even if you are solely concerned with some more restricted notion of
> “data”, still there is a lot out there, eg take a look around using
> https://datasetsearch.research.google.com/ to see what is showing up from
> research, science, govt etc.
> 
> Should protein databank files be RDFized before they fall in scope of this
> new WGs mission?
> https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format) - and if so,
> why?

RDF Signatures are for signing RDF structures. Without such a
mechanism, you have to sign the syntax of an RDF document, which means
you have to keep it around, serve it preferentially whenever anyone
asks for a particular graph. That's a biggish ask of a quad store. It
would also involve inventing some protocol to say "please dig up the
original serialization" and probably some other convention. In the
end, it would be brittle and most folks would consider it a crappy hack.

You can invent all sorts of conventions to use in those signatures;
VCs, payer-signed checks, doctor-signed clinical data, etc. One handy
one is just an assertion that the MD5 sum of some doc is X. So there's
your other formats use case. You could even have a camera or MRI sign
the images it took by sticking the signature in Exif.

But wait, you say, protein data sits in databases; how can I reliably
reproduce the same MD5? You can't. RDF Signatures, XML Signatures and
PGP can all sign some representation of a foreign structure, but
without structure-specific canonicalization, you can't confidently
reproduce that structure from a database. You can triplify it and
canonicalize/sign that; you can native-canonicalize it and sign its
MD5. You could, if you wanted to make it self-describing, add another
assertion identifying the C14N method used to do so.

You could instead do that with the existing XML DSig stack, but then
you'd throw away erasure and all the other stuff we like about RDF. If
any of us had to pick a language in which to record discriptions of
resources, we'd likely pick RDF.


> Potential huge scope established, what does W3C do in this area?

All we need is a way to sign assertions and probably a common
predicate for document hashes (e.g. MD5). Interested communities can
build on that depending on their canonicalization needs and
engineering choices between embedding (with lots of '\'s) or
referencing the hashed document.


> Remember that RDF was introduced as a metadata system - its founding
> purpose was exactly to describe the sprawling chaos of the above file
> format diversity, and not to replace it all with triples.
> 
> I understand that XML Signature can/could detached sign any format, but
> that it may also be showing its age as a standard, having been created 21+
> years ago. Maybe it is time for it to be superseded by something more
> modern from W3C, with the option to drop-in format-specific
> canonicalization steps, such as bnode labelling etc for RDF? Why not create

Ugh, I guess I didn't have to write all that stuff above selling
format-specific C14N. Will learn to read first.

> *that* WG rather than this one, given W3C’s limited resources?

This feels like an XML vs. RDF debate and as such, has no end. I think
it suffices to say that a lot of people want to work in RDF. You could
say "what about a heterogeneous system", but then you end up inventing
an RDF representation of XML signatures, maybe using something
unpleasant like and triplification of the DOM, or maybe something that
works with XML DSig but needs extra definition for any payload you
shove into DSig. This again would be an unpleasant hack.


> https://www.w3.org/TR/xmldsig-core2/#sec-Introduction tells us,
> “ This document specifies XML syntax and processing rules for creating and
> representing digital signatures. XML Signatures can be applied to any digital
> content (data object) <https://www.w3.org/TR/xmldsig-core2/#def-DataObject>
> ,”
> 
> The last actual W3C REC says the same,
> https://www.w3.org/TR/xmldsig-core1/#sec-Introduction
> 
> So that’s a W3C recommended technology that W3C currently says is up to the
> job. It is old, its flaws are well known, it isn’t clear if it has been
> abandoned or just in maintainance mode, but it remains a recommended
> standard for now.
> 
> Please indulge a thought experiment.
> 
> For all the non-RDF formats I touched on above, should they (a) use XML
> detached signature (b) go through the lengthy and painful process of trying
> to create a rich RDFS/OWL-facilitated model of their content as an RDF
> graph, so they can sign using Linked Data Signature, or my suggestion
> below, (c) - stick it all in one triple in an easily round-trippable way.
> Let’s explore (c).
> 
> Apache HTTPd server logs could trivially be mapped into RDF’s data model
> too, as could anything. Would this be in scope of the new WG per its
> charter? (I am ignoring input docs for now, as Ivan has advised)
> 
>  Let’s define such a mapping from bytes to a 1 triple graph, call it the
> “Retro Graph Mapping” (RGM). It retrospectively maps any byte sequence into
> an RDF graph. It is similar in spirit to the idea of RDF graph literals,
> perhaps.
> 
> For any sequence of bytes ‘bs’, create a corresponding space-separated
> hex-encoded sequence of lowercase pairs of unicode characters. RGM graphs
> have one triple which varies only in its literal value content and
> datatype. For example:
> 
> <file:/dev/🦖/RGMv1> rdf:value “hex sequence here”^^wikidata:Q5153426 .

I object in the strongest possible terms to your attempt to emojify
all data. <emoticon elided/>


> Formal spec to follow but basic idea is a triple-ization in which subject
> and object URIs are fixed, no language tag, all everything is in a single
> value, rdf: and wikidata: prefixes and 🦖 are used here for simple example;
> datatype is an optional format identifier that could be discarded. A
> similar approach could be used to generate multi-graph datasets. A
> discardable filename preference could be packed in to the file: URI too, if
> desired. For text-oriented content it might be worth considering a more
> readable representation than hex codes, but RGM is not designed for humans
> to read. RGM is not especially useful for data access, SPARQL etc., but it
> will make *any* kind of data signable by the work about to be chartered by
> W3C.
> 
> RGM can reflect any data format into a single-triple,  very
> efficiently-sorted, bnode-free rdf graph. This is both stupid and
>  powerful.
> 
> Is the coordinated, lead-the-web-to-its-full-potential W3C view here that
> simultaneously  the following are both true?:
> 
> 1.) XML Signature is good enough for all of the world’s files and data
> except for the RDF-graph cases.
> 2.) XML Signature is so inappropriate and/our outdated that it is barely
> mentioned for the case of signing RDF in the proposed charter and explainer.

We mustn't forget about JWS, DER-encoded ASN.1, and probably a litany
of format-specific forms of wrappers for cryptographic signatures.
Making developers change formats leads to lots of hacks. The simplest
thing to do is to define your own format-specific wrapper, à la RDF
signatures.


> If the truth is that XML Signature is a pain point for W3C in 2021 then the
> fact that it is about to spin up a WG that can do some of the same things
> deserves more attention than the zero mentions granted to the topic by the
> draft charter.
> 
> The draft Signed Linked Data WG explainer says *“roughly, the same approach
> as for XML [xmldsig-core1
> <https://w3c.github.io/lds-wg-charter/explainer.html#bib-xmldsig-core1>].”
> and yet the actual suggested charter does not mention XML let alone W3C’s
> huge piece of work in this area, XML Signature. Despite the fact that for
> any piece of data in the web, W3C offers both XML Signature and also (via
> RGM’s retro-graph mapping into a triple), Linked Data Signature as
> potentially relevant technologies.*

At some point, I suggested referencing XML DSig in the charter but
Ivan said it leads to lots of confusion so I dropped that PR. I think
people in the WG should understand XML DSig, but I'm ambivalent about
whether drawing DSig analogies helps or hinders the charter reviewer.


> *We know that all web content that can be turned into a normalized triple
> via RGM as **I sketch above. Or it could be signed with XML Signature. *
> 
> *For cases like CSV, YAML, SVG, is there *anything* to be gained in the
> Signature world from doing a more careful and fine-grained mapping into
> RDF, beyond just avoiding having to use 20-year old XML-flavoured signature
> technology? Is RGM too stupid to use, leaving those formats behind?*

RGM isn't very far from embedding a non-RDF document in an RDF
Signature, which you can toally do. The engineering choices around
that will probably focus on whether its easier/lighter-weight to
refrences it elsewhere or whether you gotta load the whole thing into
the sig.


> *Why should RDF content get modernized web-standard signature tech first?
> Why not make something modern for the content of the entire world-wide web
> and then plug in the bnode-labelling preprocessor for the RDF special case?*
> 
> *The fact that W3C proposes to make new REC-track work on RDF Signature,
> while simultaneously leaving its ancient XML Signature Recommendation
> roaming the earth like an undead dinosaur ought to ring alarm bells here.
> What are the prospects of this new RDF work being carefully maintained by
> W3C in 20 years? It feels like this essentially general purpose piece of
> new work is being put through as a Linked Data thing because when evaluated
> by the wider set of stakeholders it will attract more skepticism than
> enthusiasm.*

I don't see a conflict here. If I needed signatures in some XML stack,
I'd reach for DSig. It doesn't really *need* maintenance; it just
works.


> *It is always easier to create new things than to curate old messes, and it
> is always easier to scope things tightly than to risk a design by committee
> that nearly-kinda meets everyone’s goals. The idea of entangling this new
> set of work items with XML Signature ought to be slightly terrifying, but
> cross-donain standards coordination is W3C’s core duty and strength.*
> 
> *Any yet, any/all web content can be trivially brought into scope of the
> new WG via RGM. Which puts us substantially in the same territory as that
> currently occupied by the existing XML Sig W3C REC.*
> 
> *I know it is annoying to introduce new terminology but I do so here in
> pursuit of consensus. Since any data can (via RGM or hard work
> ontologizing) be “linked data” sufficiently to be signable via the proposed
> new standards, we can ask ourselves whether custodians of data in non-RDF
> formats would gain anything by doing so. If they would, the WG scope should
> be admitted to be data-signing, not linked-data-signing. If not, I’d like

I whole-heartedly support that. I don't demand it because I can live
with the LD marketing, but if this WG is doing more than RDF
data-signing, I and others need to know that.


> to understand why. Of course I understand the general benefits of using RDF
> more wholeheartedly, but for the case of signing specifically the picture
> does not yet feel clear.*

Right now, my needs are pretty simple; recieve a clinical record in
RDF, signed by the sender; record the signature and add the RDF to a
database. Already, the sender's life is harder if they have to have an
XML stack in addition to the RDF stack. My life is harder 'cause I
have to keep the original document, as well as sticking the queryable
triples in a data set. Everyone who wants to consume verified data
also has to have two stacks, as well as know to ask for the original
data rather than by a query like:
  GRAPH <doc> {?s ?p ?o}

Additionally, the consumer has to read the same format that the sender
submitted; no conneg.

It would only get more painful as use cases progressed, e.g. signing
diffs or regions of docs.

I don't think anyone told the authors of RFC7515 that they should use
XML and I don't think they'd have been dissuaded. (They probably would
have just shopped around for some other standards body.) Likewise, the
simplicity and reasonable elegance of RDF Signatures make it way more
attractive to use than XML DSig or JWS. I expect that will be true for
many folks. I think that's grounds for standards work.


> *Dan*
> 
> 
> 
> 
> > Thanks,
> > David Booth
> >
> >
Received on Friday, 11 June 2021 13:31:41 UTC