Re: Thoughts on the LDS WG chartering discussion

(Sorry, this is long.)

On Fri, 11 Jun 2021 at 00:19, David Booth <david@dbooth.org> wrote:

> On 6/10/21 11:08 AM, Ivan Herman wrote:
> >> On 10 Jun 2021, at 16:13, David Booth <david@dbooth.org
> >> I still feel like I am somehow missing a fundamental assumption that
> >> others are making and I have not yet been able to identify.
>

Know the feeling.

[…]

The other thing that I still fundamentally do not yet grasp about the
> proposed charter is this: Why is it restricted to RDF source documents?
>    Clearly the canonicalization algorithm is about RDF, so that much I
> understand.  But for the digital signature vocabulary, why wouldn't it
> also be useful to be able to sign, say, a PDF document?  Why should the
> RDF signing vocabulary be limited to talking about RDF documents?  Or am
> I misunderstanding the intent here?   Perhaps if there were a simple,
> complete example, it would help.  Again, I feel like I am missing some
> of the assumed context.
>


This is a very reasonable and pertinent  question.

The vast vast majority of content on the web is not best considered as
being purely RDF, even if you can often project out an RDF view of it. And
the bits that are RDF-based rarely end up in anything most folks would
consider an “RDF store”. But the distinction is a strange one, as explored
below.

There is a *lot* of non-RDF data out there. I can hardly believe this needs
emphasis in 2021 but here we are.

This content (or rather, the billions of web users touched by it) deserves
ways of being assured by modern W3C-standard Signature.

We should not gloss over the scale of this. We are talking about petabytes
of data here. Literally the entire contents of the web, for starters.

There are lots of other data-related formats out there in the web - CSS,
CSV, SQL dumps, the email mbox format, iCalendar, vCARD, non-RDF property
graphs,   MARC, YAML, XML itself of course, Protobufs, Apache Arrow,
microformats, HTML5, HTML-anything, SGML, Midi files, MP3, WAV,
Prolog/Datalog, OWL, N3, RIF, … package formats like ZIP, JAR, … image
formats like PNG, JPG, GIF, … the SVG case is interesting (“this is
definitely our logo”) but the others have embedded metadata too, EXIF not
being RDF whilst XMP being RDF. Maybe OWL/RIF/N3 could use their “compiled
down to triples” view, but should we extend that argument to everyone else?
Rdf-star? What about SPARQL queries? Windows .ini files, …? PDFs, Flash,
video file formats? In an age of misinformation facilitated by the web, it
is not obvious that W3C’s next Data Signature WG should cover only “Linked”
RDF data, and ignore media formats. What about robots.txt files? CBOR? The
.sna format for snapshots of ZX Spectrum games? VMWare images? .iso disk
images?

I could go on but
https://en.m.wikipedia.org/wiki/List_of_file_formats exists. We didn’t even
touch coding languages (JS, Java, JVM, WASM, GLSL, COBOL),  or notebook
formats. Or .rtf files.

Even if you are solely concerned with some more restricted notion of
“data”, still there is a lot out there, eg take a look around using
https://datasetsearch.research.google.com/ to see what is showing up from
research, science, govt etc.

Should protein databank files be RDFized before they fall in scope of this
new WGs mission?
https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format) - and if so,
why?





Potential huge scope established, what does W3C do in this area?

Remember that RDF was introduced as a metadata system - its founding
purpose was exactly to describe the sprawling chaos of the above file
format diversity, and not to replace it all with triples.

I understand that XML Signature can/could detached sign any format, but
that it may also be showing its age as a standard, having been created 21+
years ago. Maybe it is time for it to be superseded by something more
modern from W3C, with the option to drop-in format-specific
canonicalization steps, such as bnode labelling etc for RDF? Why not create
*that* WG rather than this one, given W3C’s limited resources?



https://www.w3.org/TR/xmldsig-core2/#sec-Introduction tells us,
“ This document specifies XML syntax and processing rules for creating and
representing digital signatures. XML Signatures can be applied to any digital
content (data object) <https://www.w3.org/TR/xmldsig-core2/#def-DataObject>
,”

The last actual W3C REC says the same,
https://www.w3.org/TR/xmldsig-core1/#sec-Introduction

So that’s a W3C recommended technology that W3C currently says is up to the
job. It is old, its flaws are well known, it isn’t clear if it has been
abandoned or just in maintainance mode, but it remains a recommended
standard for now.

Please indulge a thought experiment.

For all the non-RDF formats I touched on above, should they (a) use XML
detached signature (b) go through the lengthy and painful process of trying
to create a rich RDFS/OWL-facilitated model of their content as an RDF
graph, so they can sign using Linked Data Signature, or my suggestion
below, (c) - stick it all in one triple in an easily round-trippable way.
Let’s explore (c).

Apache HTTPd server logs could trivially be mapped into RDF’s data model
too, as could anything. Would this be in scope of the new WG per its
charter? (I am ignoring input docs for now, as Ivan has advised)

 Let’s define such a mapping from bytes to a 1 triple graph, call it the
“Retro Graph Mapping” (RGM). It retrospectively maps any byte sequence into
an RDF graph. It is similar in spirit to the idea of RDF graph literals,
perhaps.

For any sequence of bytes ‘bs’, create a corresponding space-separated
hex-encoded sequence of lowercase pairs of unicode characters. RGM graphs
have one triple which varies only in its literal value content and
datatype. For example:

<file:/dev/🦖/RGMv1> rdf:value “hex sequence here” ^wikidata:Q5153426 .

Formal spec to follow but basic idea is a triple-ization in which subject
and object URIs are fixed, no language tag, all everything is in a single
value, rdf: and wikidata: prefixes and 🦖 are used here for simple example;
datatype is an optional format identifier that could be discarded. A
similar approach could be used to generate multi-graph datasets. A
discardable filename preference could be packed in to the file: URI too, if
desired. For text-oriented content it might be worth considering a more
readable representation than hex codes, but RGM is not designed for humans
to read. RGM is not especially useful for data access, SPARQL etc., but it
will make *any* kind of data signable by the work about to be chartered by
W3C.

RGM can reflect any data format into a single-triple,  very
efficiently-sorted, bnode-free rdf graph. This is both stupid and
 powerful.

Is the coordinated, lead-the-web-to-its-full-potential W3C view here that
simultaneously  the following are both true?:

1.) XML Signature is good enough for all of the world’s files and data
except for the RDF-graph cases.
2.) XML Signature is so inappropriate and/our outdated that it is barely
mentioned for the case of signing RDF in the proposed charter and explainer.

If the truth is that XML Signature is a pain point for W3C in 2021 then the
fact that it is about to spin up a WG that can do some of the same things
deserves more attention than the zero mentions granted to the topic by the
draft charter.

The draft Signed Linked Data WG explainer says *“roughly, the same approach
as for XML [xmldsig-core1
<https://w3c.github.io/lds-wg-charter/explainer.html#bib-xmldsig-core1>].”
and yet the actual suggested charter does not mention XML let alone W3C’s
huge piece of work in this area, XML Signature. Despite the fact that for
any piece of data in the web, W3C offers both XML Signature and also (via
RGM’s retro-graph mapping into a triple), Linked Data Signature as
potentially relevant technologies.*

*We know that all web content that can be turned into a normalized triple
via RGM as **I sketch above. Or it could be signed with XML Signature. *

*For cases like CSV, YAML, SVG, is there *anything* to be gained in the
Signature world from doing a more careful and fine-grained mapping into
RDF, beyond just avoiding having to use 20-year old XML-flavoured signature
technology? Is RGM too stupid to use, leaving those formats behind?*

*Why should RDF content get modernized web-standard signature tech first?
Why not make something modern for the content of the entire world-wide web
and then plug in the bnode-labelling preprocessor for the RDF special case?*

*The fact that W3C proposes to make new REC-track work on RDF Signature,
while simultaneously leaving its ancient XML Signature Recommendation
roaming the earth like an undead dinosaur ought to ring alarm bells here.
What are the prospects of this new RDF work being carefully maintained by
W3C in 20 years? It feels like this essentially general purpose piece of
new work is being put through as a Linked Data thing because when evaluated
by the wider set of stakeholders it will attract more skepticism than
enthusiasm.*

*It is always easier to create new things than to curate old messes, and it
is always easier to scope things tightly than to risk a design by committee
that nearly-kinda meets everyone’s goals. The idea of entangling this new
set of work items with XML Signature ought to be slightly terrifying, but
cross-donain standards coordination is W3C’s core duty and strength.*

*Any yet, any/all web content can be trivially brought into scope of the
new WG via RGM. Which puts us substantially in the same territory as that
currently occupied by the existing XML Sig W3C REC.*

*I know it is annoying to introduce new terminology but I do so here in
pursuit of consensus. Since any data can (via RGM or hard work
ontologizing) be “linked data” sufficiently to be signable via the proposed
new standards, we can ask ourselves whether custodians of data in non-RDF
formats would gain anything by doing so. If they would, the WG scope should
be admitted to be data-signing, not linked-data-signing. If not, I’d like
to understand why. Of course I understand the general benefits of using RDF
more wholeheartedly, but for the case of signing specifically the picture
does not yet feel clear.*

*Dan*




> Thanks,
> David Booth
>
>

Received on Friday, 11 June 2021 09:09:48 UTC