- From: Dan Brickley <danbri@danbri.org>
- Date: Fri, 11 Jun 2021 10:08:56 +0100
- To: David Booth <david@dbooth.org>
- Cc: semantic-web@w3.org
- Message-ID: <CAFfrAFqiFh5pnWN2Ai_h+xjxjctC-v4-8agq0gxyqmUBxnkMyg@mail.gmail.com>
(Sorry, this is long.) On Fri, 11 Jun 2021 at 00:19, David Booth <david@dbooth.org> wrote: > On 6/10/21 11:08 AM, Ivan Herman wrote: > >> On 10 Jun 2021, at 16:13, David Booth <david@dbooth.org > >> I still feel like I am somehow missing a fundamental assumption that > >> others are making and I have not yet been able to identify. > Know the feeling. […] The other thing that I still fundamentally do not yet grasp about the > proposed charter is this: Why is it restricted to RDF source documents? > Clearly the canonicalization algorithm is about RDF, so that much I > understand. But for the digital signature vocabulary, why wouldn't it > also be useful to be able to sign, say, a PDF document? Why should the > RDF signing vocabulary be limited to talking about RDF documents? Or am > I misunderstanding the intent here? Perhaps if there were a simple, > complete example, it would help. Again, I feel like I am missing some > of the assumed context. > This is a very reasonable and pertinent question. The vast vast majority of content on the web is not best considered as being purely RDF, even if you can often project out an RDF view of it. And the bits that are RDF-based rarely end up in anything most folks would consider an “RDF store”. But the distinction is a strange one, as explored below. There is a *lot* of non-RDF data out there. I can hardly believe this needs emphasis in 2021 but here we are. This content (or rather, the billions of web users touched by it) deserves ways of being assured by modern W3C-standard Signature. We should not gloss over the scale of this. We are talking about petabytes of data here. Literally the entire contents of the web, for starters. There are lots of other data-related formats out there in the web - CSS, CSV, SQL dumps, the email mbox format, iCalendar, vCARD, non-RDF property graphs, MARC, YAML, XML itself of course, Protobufs, Apache Arrow, microformats, HTML5, HTML-anything, SGML, Midi files, MP3, WAV, Prolog/Datalog, OWL, N3, RIF, … package formats like ZIP, JAR, … image formats like PNG, JPG, GIF, … the SVG case is interesting (“this is definitely our logo”) but the others have embedded metadata too, EXIF not being RDF whilst XMP being RDF. Maybe OWL/RIF/N3 could use their “compiled down to triples” view, but should we extend that argument to everyone else? Rdf-star? What about SPARQL queries? Windows .ini files, …? PDFs, Flash, video file formats? In an age of misinformation facilitated by the web, it is not obvious that W3C’s next Data Signature WG should cover only “Linked” RDF data, and ignore media formats. What about robots.txt files? CBOR? The .sna format for snapshots of ZX Spectrum games? VMWare images? .iso disk images? I could go on but https://en.m.wikipedia.org/wiki/List_of_file_formats exists. We didn’t even touch coding languages (JS, Java, JVM, WASM, GLSL, COBOL), or notebook formats. Or .rtf files. Even if you are solely concerned with some more restricted notion of “data”, still there is a lot out there, eg take a look around using https://datasetsearch.research.google.com/ to see what is showing up from research, science, govt etc. Should protein databank files be RDFized before they fall in scope of this new WGs mission? https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format) - and if so, why? Potential huge scope established, what does W3C do in this area? Remember that RDF was introduced as a metadata system - its founding purpose was exactly to describe the sprawling chaos of the above file format diversity, and not to replace it all with triples. I understand that XML Signature can/could detached sign any format, but that it may also be showing its age as a standard, having been created 21+ years ago. Maybe it is time for it to be superseded by something more modern from W3C, with the option to drop-in format-specific canonicalization steps, such as bnode labelling etc for RDF? Why not create *that* WG rather than this one, given W3C’s limited resources? https://www.w3.org/TR/xmldsig-core2/#sec-Introduction tells us, “ This document specifies XML syntax and processing rules for creating and representing digital signatures. XML Signatures can be applied to any digital content (data object) <https://www.w3.org/TR/xmldsig-core2/#def-DataObject> ,” The last actual W3C REC says the same, https://www.w3.org/TR/xmldsig-core1/#sec-Introduction So that’s a W3C recommended technology that W3C currently says is up to the job. It is old, its flaws are well known, it isn’t clear if it has been abandoned or just in maintainance mode, but it remains a recommended standard for now. Please indulge a thought experiment. For all the non-RDF formats I touched on above, should they (a) use XML detached signature (b) go through the lengthy and painful process of trying to create a rich RDFS/OWL-facilitated model of their content as an RDF graph, so they can sign using Linked Data Signature, or my suggestion below, (c) - stick it all in one triple in an easily round-trippable way. Let’s explore (c). Apache HTTPd server logs could trivially be mapped into RDF’s data model too, as could anything. Would this be in scope of the new WG per its charter? (I am ignoring input docs for now, as Ivan has advised) Let’s define such a mapping from bytes to a 1 triple graph, call it the “Retro Graph Mapping” (RGM). It retrospectively maps any byte sequence into an RDF graph. It is similar in spirit to the idea of RDF graph literals, perhaps. For any sequence of bytes ‘bs’, create a corresponding space-separated hex-encoded sequence of lowercase pairs of unicode characters. RGM graphs have one triple which varies only in its literal value content and datatype. For example: <file:/dev/🦖/RGMv1> rdf:value “hex sequence here” ^wikidata:Q5153426 . Formal spec to follow but basic idea is a triple-ization in which subject and object URIs are fixed, no language tag, all everything is in a single value, rdf: and wikidata: prefixes and 🦖 are used here for simple example; datatype is an optional format identifier that could be discarded. A similar approach could be used to generate multi-graph datasets. A discardable filename preference could be packed in to the file: URI too, if desired. For text-oriented content it might be worth considering a more readable representation than hex codes, but RGM is not designed for humans to read. RGM is not especially useful for data access, SPARQL etc., but it will make *any* kind of data signable by the work about to be chartered by W3C. RGM can reflect any data format into a single-triple, very efficiently-sorted, bnode-free rdf graph. This is both stupid and powerful. Is the coordinated, lead-the-web-to-its-full-potential W3C view here that simultaneously the following are both true?: 1.) XML Signature is good enough for all of the world’s files and data except for the RDF-graph cases. 2.) XML Signature is so inappropriate and/our outdated that it is barely mentioned for the case of signing RDF in the proposed charter and explainer. If the truth is that XML Signature is a pain point for W3C in 2021 then the fact that it is about to spin up a WG that can do some of the same things deserves more attention than the zero mentions granted to the topic by the draft charter. The draft Signed Linked Data WG explainer says *“roughly, the same approach as for XML [xmldsig-core1 <https://w3c.github.io/lds-wg-charter/explainer.html#bib-xmldsig-core1>].” and yet the actual suggested charter does not mention XML let alone W3C’s huge piece of work in this area, XML Signature. Despite the fact that for any piece of data in the web, W3C offers both XML Signature and also (via RGM’s retro-graph mapping into a triple), Linked Data Signature as potentially relevant technologies.* *We know that all web content that can be turned into a normalized triple via RGM as **I sketch above. Or it could be signed with XML Signature. * *For cases like CSV, YAML, SVG, is there *anything* to be gained in the Signature world from doing a more careful and fine-grained mapping into RDF, beyond just avoiding having to use 20-year old XML-flavoured signature technology? Is RGM too stupid to use, leaving those formats behind?* *Why should RDF content get modernized web-standard signature tech first? Why not make something modern for the content of the entire world-wide web and then plug in the bnode-labelling preprocessor for the RDF special case?* *The fact that W3C proposes to make new REC-track work on RDF Signature, while simultaneously leaving its ancient XML Signature Recommendation roaming the earth like an undead dinosaur ought to ring alarm bells here. What are the prospects of this new RDF work being carefully maintained by W3C in 20 years? It feels like this essentially general purpose piece of new work is being put through as a Linked Data thing because when evaluated by the wider set of stakeholders it will attract more skepticism than enthusiasm.* *It is always easier to create new things than to curate old messes, and it is always easier to scope things tightly than to risk a design by committee that nearly-kinda meets everyone’s goals. The idea of entangling this new set of work items with XML Signature ought to be slightly terrifying, but cross-donain standards coordination is W3C’s core duty and strength.* *Any yet, any/all web content can be trivially brought into scope of the new WG via RGM. Which puts us substantially in the same territory as that currently occupied by the existing XML Sig W3C REC.* *I know it is annoying to introduce new terminology but I do so here in pursuit of consensus. Since any data can (via RGM or hard work ontologizing) be “linked data” sufficiently to be signable via the proposed new standards, we can ask ourselves whether custodians of data in non-RDF formats would gain anything by doing so. If they would, the WG scope should be admitted to be data-signing, not linked-data-signing. If not, I’d like to understand why. Of course I understand the general benefits of using RDF more wholeheartedly, but for the case of signing specifically the picture does not yet feel clear.* *Dan* > Thanks, > David Booth > >
Received on Friday, 11 June 2021 09:09:48 UTC