Re: Thoughts on the LDS WG chartering discussion

It looks like some formatting of this email may not be showing up properly-
but seems readable at

https://lists.w3.org/Archives/Public/semantic-web/2021Jun/0102.html

Apologies for the noise!

San

On Fri, 11 Jun 2021 at 10:08, Dan Brickley <danbri@danbri.org> wrote:

> (Sorry, this is long.)
>
> On Fri, 11 Jun 2021 at 00:19, David Booth <david@dbooth.org> wrote:
>
>> On 6/10/21 11:08 AM, Ivan Herman wrote:
>> >> On 10 Jun 2021, at 16:13, David Booth <david@dbooth.org
>> >> I still feel like I am somehow missing a fundamental assumption that
>> >> others are making and I have not yet been able to identify.
>>
>
> Know the feeling.
>
> […]
>
> The other thing that I still fundamentally do not yet grasp about the
>> proposed charter is this: Why is it restricted to RDF source documents?
>>    Clearly the canonicalization algorithm is about RDF, so that much I
>> understand.  But for the digital signature vocabulary, why wouldn't it
>> also be useful to be able to sign, say, a PDF document?  Why should the
>> RDF signing vocabulary be limited to talking about RDF documents?  Or am
>> I misunderstanding the intent here?   Perhaps if there were a simple,
>> complete example, it would help.  Again, I feel like I am missing some
>> of the assumed context.
>>
>
>
> This is a very reasonable and pertinent  question.
>
> The vast vast majority of content on the web is not best considered as
> being purely RDF, even if you can often project out an RDF view of it. And
> the bits that are RDF-based rarely end up in anything most folks would
> consider an “RDF store”. But the distinction is a strange one, as explored
> below.
>
> There is a *lot* of non-RDF data out there. I can hardly believe this
> needs emphasis in 2021 but here we are.
>
> This content (or rather, the billions of web users touched by it) deserves
> ways of being assured by modern W3C-standard Signature.
>
> We should not gloss over the scale of this. We are talking about petabytes
> of data here. Literally the entire contents of the web, for starters.
>
> There are lots of other data-related formats out there in the web - CSS,
> CSV, SQL dumps, the email mbox format, iCalendar, vCARD, non-RDF property
> graphs,   MARC, YAML, XML itself of course, Protobufs, Apache Arrow,
> microformats, HTML5, HTML-anything, SGML, Midi files, MP3, WAV,
> Prolog/Datalog, OWL, N3, RIF, … package formats like ZIP, JAR, … image
> formats like PNG, JPG, GIF, … the SVG case is interesting (“this is
> definitely our logo”) but the others have embedded metadata too, EXIF not
> being RDF whilst XMP being RDF. Maybe OWL/RIF/N3 could use their “compiled
> down to triples” view, but should we extend that argument to everyone else?
> Rdf-star? What about SPARQL queries? Windows .ini files, …? PDFs, Flash,
> video file formats? In an age of misinformation facilitated by the web, it
> is not obvious that W3C’s next Data Signature WG should cover only “Linked”
> RDF data, and ignore media formats. What about robots.txt files? CBOR? The
> .sna format for snapshots of ZX Spectrum games? VMWare images? .iso disk
> images?
>
> I could go on but
> https://en.m.wikipedia.org/wiki/List_of_file_formats exists. We didn’t
> even touch coding languages (JS, Java, JVM, WASM, GLSL, COBOL),  or
> notebook formats. Or .rtf files.
>
> Even if you are solely concerned with some more restricted notion of
> “data”, still there is a lot out there, eg take a look around using
> https://datasetsearch.research.google.com/ to see what is showing up from
> research, science, govt etc.
>
> Should protein databank files be RDFized before they fall in scope of this
> new WGs mission?
> https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format) - and if
> so, why?
>
>
>
>
>
> Potential huge scope established, what does W3C do in this area?
>
> Remember that RDF was introduced as a metadata system - its founding
> purpose was exactly to describe the sprawling chaos of the above file
> format diversity, and not to replace it all with triples.
>
> I understand that XML Signature can/could detached sign any format, but
> that it may also be showing its age as a standard, having been created 21+
> years ago. Maybe it is time for it to be superseded by something more
> modern from W3C, with the option to drop-in format-specific
> canonicalization steps, such as bnode labelling etc for RDF? Why not create
> *that* WG rather than this one, given W3C’s limited resources?
>
>
>
> https://www.w3.org/TR/xmldsig-core2/#sec-Introduction tells us,
> “ This document specifies XML syntax and processing rules for creating
> and representing digital signatures. XML Signatures can be applied to any digital
> content (data object)
> <https://www.w3.org/TR/xmldsig-core2/#def-DataObject>,”
>
> The last actual W3C REC says the same,
> https://www.w3.org/TR/xmldsig-core1/#sec-Introduction
>
> So that’s a W3C recommended technology that W3C currently says is up to
> the job. It is old, its flaws are well known, it isn’t clear if it has been
> abandoned or just in maintainance mode, but it remains a recommended
> standard for now.
>
> Please indulge a thought experiment.
>
> For all the non-RDF formats I touched on above, should they (a) use XML
> detached signature (b) go through the lengthy and painful process of trying
> to create a rich RDFS/OWL-facilitated model of their content as an RDF
> graph, so they can sign using Linked Data Signature, or my suggestion
> below, (c) - stick it all in one triple in an easily round-trippable way.
> Let’s explore (c).
>
> Apache HTTPd server logs could trivially be mapped into RDF’s data model
> too, as could anything. Would this be in scope of the new WG per its
> charter? (I am ignoring input docs for now, as Ivan has advised)
>
>  Let’s define such a mapping from bytes to a 1 triple graph, call it the
> “Retro Graph Mapping” (RGM). It retrospectively maps any byte sequence into
> an RDF graph. It is similar in spirit to the idea of RDF graph literals,
> perhaps.
>
> For any sequence of bytes ‘bs’, create a corresponding space-separated
> hex-encoded sequence of lowercase pairs of unicode characters. RGM graphs
> have one triple which varies only in its literal value content and
> datatype. For example:
>
> <file:/dev/🦖/RGMv1> rdf:value “hex sequence here” ^wikidata:Q5153426 .
>
> Formal spec to follow but basic idea is a triple-ization in which subject
> and object URIs are fixed, no language tag, all everything is in a single
> value, rdf: and wikidata: prefixes and 🦖 are used here for simple example;
> datatype is an optional format identifier that could be discarded. A
> similar approach could be used to generate multi-graph datasets. A
> discardable filename preference could be packed in to the file: URI too, if
> desired. For text-oriented content it might be worth considering a more
> readable representation than hex codes, but RGM is not designed for humans
> to read. RGM is not especially useful for data access, SPARQL etc., but it
> will make *any* kind of data signable by the work about to be chartered by
> W3C.
>
> RGM can reflect any data format into a single-triple,  very
> efficiently-sorted, bnode-free rdf graph. This is both stupid and
>  powerful.
>
> Is the coordinated, lead-the-web-to-its-full-potential W3C view here that
> simultaneously  the following are both true?:
>
> 1.) XML Signature is good enough for all of the world’s files and data
> except for the RDF-graph cases.
> 2.) XML Signature is so inappropriate and/our outdated that it is barely
> mentioned for the case of signing RDF in the proposed charter and explainer.
>
> If the truth is that XML Signature is a pain point for W3C in 2021 then
> the fact that it is about to spin up a WG that can do some of the same
> things deserves more attention than the zero mentions granted to the topic
> by the draft charter.
>
> The draft Signed Linked Data WG explainer says *“roughly, the same
> approach as for XML [xmldsig-core1
> <https://w3c.github.io/lds-wg-charter/explainer.html#bib-xmldsig-core1>].”
> and yet the actual suggested charter does not mention XML let alone W3C’s
> huge piece of work in this area, XML Signature. Despite the fact that for
> any piece of data in the web, W3C offers both XML Signature and also (via
> RGM’s retro-graph mapping into a triple), Linked Data Signature as
> potentially relevant technologies.*
>
> *We know that all web content that can be turned into a normalized triple
> via RGM as **I sketch above. Or it could be signed with XML Signature. *
>
> *For cases like CSV, YAML, SVG, is there *anything* to be gained in the
> Signature world from doing a more careful and fine-grained mapping into
> RDF, beyond just avoiding having to use 20-year old XML-flavoured signature
> technology? Is RGM too stupid to use, leaving those formats behind?*
>
> *Why should RDF content get modernized web-standard signature tech first?
> Why not make something modern for the content of the entire world-wide web
> and then plug in the bnode-labelling preprocessor for the RDF special case?*
>
> *The fact that W3C proposes to make new REC-track work on RDF Signature,
> while simultaneously leaving its ancient XML Signature Recommendation
> roaming the earth like an undead dinosaur ought to ring alarm bells here.
> What are the prospects of this new RDF work being carefully maintained by
> W3C in 20 years? It feels like this essentially general purpose piece of
> new work is being put through as a Linked Data thing because when evaluated
> by the wider set of stakeholders it will attract more skepticism than
> enthusiasm.*
>
> *It is always easier to create new things than to curate old messes, and
> it is always easier to scope things tightly than to risk a design by
> committee that nearly-kinda meets everyone’s goals. The idea of entangling
> this new set of work items with XML Signature ought to be slightly
> terrifying, but cross-donain standards coordination is W3C’s core duty and
> strength.*
>
> *Any yet, any/all web content can be trivially brought into scope of the
> new WG via RGM. Which puts us substantially in the same territory as that
> currently occupied by the existing XML Sig W3C REC.*
>
> *I know it is annoying to introduce new terminology but I do so here in
> pursuit of consensus. Since any data can (via RGM or hard work
> ontologizing) be “linked data” sufficiently to be signable via the proposed
> new standards, we can ask ourselves whether custodians of data in non-RDF
> formats would gain anything by doing so. If they would, the WG scope should
> be admitted to be data-signing, not linked-data-signing. If not, I’d like
> to understand why. Of course I understand the general benefits of using RDF
> more wholeheartedly, but for the case of signing specifically the picture
> does not yet feel clear.*
>
> *Dan*
>
>
>
>
>> Thanks,
>> David Booth
>>
>>

Received on Friday, 11 June 2021 09:17:24 UTC