Re: Event Updated: RDF Canonicalization and Hash Working Group

> On Mar 3, 2023, at 3:00 AM, Pierre-Antoine Champin <pierre-antoine@w3.org> wrote:
> 
> 
> 
> On 02/03/2023 02:13, Gregg Kellogg wrote:
>>> On Mar 1, 2023, at 12:29 AM, Ivan Herman <ivan@w3.org> <mailto:ivan@w3.org> wrote:
>>> 
>>> Hi Gregg,
>>> 
>>> all this worries me. I do not want to find the RCH work suspended unnecessarily on this issue, which is not really ours. As you say, the details of the nquads canonicalization doe not affect the algorithm, and the goal of this WG is to standardize that one and nothing else.
>> 
>> I don’t see any technical barriers to getting the canonicalization work done, once the administrative issues are cleared. I suspect RDF-star (and thus N-Quads) can do an FPWD soon, although moving to CR, PR, and REC could be delayed due to the complexities of standardizing quoted triples and their semantics. The direction of “text direction” could also slow things down; both of those should be ‘at risk”. In any case, I think that canonicalizing datasets including quoted triples should probably require a transformation to a reified form, and that may be true for language-tagged strings having a text direction as well.
>> 
>>> The reason this is the problem because, if we are not careful, we may find ourselves hooked on the CR phase unnecessarily. Any change on the nquads canonicalization will have to spread to various RDF frameworks/libraries out there, and that will take some time. On the other hand, the implementation of URDNA usually relies on such general frameworks. I do not want to find ourselves in a state whereby the URDNA implementer would have to re-implement the nquads serialization along the line of the new RDF specification to pass all the tests.
> Well, I don't know about others, by my URDNA implementation does implement its own n-quads serializer, because
> 
> - the "regular" serializer I have does not have an option to generate canonical n-quads, and
> - even if it did, I would not trust it, as canonical n-triples/n-quads is currently under-specified (as Gregg points out below);
> - anyway, writing such a serializer is very simple...
> 
>> 
>> Existing N-Triples canonicalization is under-specified, and I believe that implementations likely already vary in the representations of control characters in strings, we just don’t know about it. Any tightening of this will likely affect existing implementations.
>> 
>>> What I would propose this WG could do is to look at all the tests which could be affected by any of those proposed changes and either remove them or make them optional tests. These tests would not affect or main goal of the CR testing, namely to prove that the URDNA algorithm, and its textual specification, is correct and interoperable (which is the real goal of the CR phase); as a consequence, these tests should not stand in a way of passing to Proposed Rec when the time comes.
>> 
>> My investigation shows just test060 as being affected, as it’s the only one which tries to test the character ranges. Dave’s recent update stresses this further, and may show up other variations. That aside, there’s the basic security consideration of having strings including unescaped control characters, as if presented to a user, these could be misleading as it is now.
> Are you referring to something like this: https://www.securityweek.com/trojan-source-attack-abuses-unicode-inject-vulnerabilities-code/ ?
> 
Essentially, thanks for the reference.
> I would argue that this is not really an issue for canonical N-Quads, which is not a programming language, and not really meant to be read by humans (whenever I need to "read" RDF, I convert it to Turtle or Trig before…).
> 
That's a reasonable distinction. Still requires something in security considerations, and maybe would be more impactful in Turtle/TriG.

>>> I realize that the RDF changes may and will affect the URDNA deployment as well, and that also worries me. But our first obligation is to correctly finalize the standard URDNA specification...
>> 
>> One possible outcome would be to leave the exiting N-Triples canonicalization unchanged (other than the ambiguity of xsd:string) and simply extend to N-Quads as would be expected. The security considerations of using unescaped strings in these representations could simply be noted. I think that is a less-perfect form, but deployment and dependency considerations may be more important. My guess is that the presence of unescaped in existing data subject to canonicalization is pretty minimal, and the changes in escaping likely have little practical impact other than in tests.
> Depends on what characters you require to escape... If that's any non-ascii character, I would argue that this could have huge impact (consider someone's family name in a VC, in countries using accented or non-latin characters…).
> 
I’ve suggested U+0000 through U+001F along with U+0022 (‘“'), U+005C (“\”), and U+007F (DEL) MUST be escaped, using ECHAR, if defined, otherwise UCHAR. No other characters may be escaped.

With the addition of prohibiting an explicit xsd:string on plain literals, I think this creates an unambiguous encoding with a minimal impact on existing uses.

Gregg
>   pa
> 
>> 
>> Gregg
>> 
>>> Ivan 
>>> 
>>> ----
>>> Ivan Herman, W3C 
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +33 6 52 46 00 43
>>> ORCID ID: https://orcid.org/0000-0003-0782-2704
>>> On 1 Mar 2023 at 00:13 +0100, Gregg Kellogg <gregg@greggkellogg.net> <mailto:gregg@greggkellogg.net>, wrote:
>>>> We might want to sped a little time on N-Quads Canonicalization. There’s an open pull request in the N-Quads repo [1] that’s facing some short-term hurdles due to questions of procedure and RDF-star WG charter. There is also an issue on reconsidering how characters are escaped [2], which would affect existing test results, particularly if simple literals are always serialized with the xsd:string datatype. While they don’t affect the canonicalization algorithm, they may affect the hashes produced for quads containing literals. The RDF-star WG will need some feedback from the RCH WG, and whatever is needed to clear any charter issues. (Note, it is an open erratum against N-Quads [3][4], which is in scope, but the extent of changes to be considered needs clarification).
>>>> 
>>>> Gregg Kellogg
>>>> gregg@greggkellogg.net <mailto:gregg@greggkellogg.net>
>>>> 
>>>> [1] https://github.com/w3c/rdf-n-quads/pull/17
>>>> [2] https://github.com/w3c/rdf-n-quads/issues/16
>>>> [3] https://www.w3.org/2001/sw/wiki/RDF1.1_Errata#erratum_32
>>>> [4] https://www.w3.org/2001/sw/wiki/RDF1.1_Errata#erratum_33
>>>> 
>>>>> On Feb 21, 2023, at 2:21 AM, Phil Archer (W3C Calendar) <noreply+calendar@w3.org> <mailto:noreply+calendar@w3.org> wrote:
>>>>> 
>>>>> View this event in your browser <https://www.w3.org/events/meetings/15ef939f-4654-4541-959d-51ba50b4d022/20230301T100000> 
>>>>> 
>>>>> RDF Canonicalization and Hash Working Group Upcoming Confirmed
>>>>> 01 March 2023, 10:00 -10:55 America/New_York
>>>>> 
>>>>> Event is recurring every other week on Wednesday, starting from 2023-02-01, until 2024-07-17
>>>>> 
>>>>> RDF Dataset Canonicalization and Hash Working Group <https://www.w3.org/groups/wg/rch/calendar>
>>>>> Bi-weekly meeting of the RCH group, back after a short break.
>>>>> 
>>>>>  
>>>>> Agenda
>>>>> Scribe list (most recent first) Gregg, pchampin, DLongley, Ahmad, PhilA, AndyS, Manu)
>>>>> New introductions
>>>>> Round the room including any comments from VCWG F2F in Miami
>>>>> Canon Issue 4 <https://github.com/w3c/rdf-canon/issues/4> (What is the output)
>>>>> Hash Issue 2 <https://github.com/w3c/rch-rdh/issues/2>
>>>>> Issue bashing <https://github.com/w3c/rdf-canon/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-asc>
>>>>>  
>>>>> Joining Instructions
>>>>> Instructions are restricted to meeting participants. You need to log in <https://auth.w3.org/?url=https%3A%2F%2Fwww.w3.org%2Fevents%2Fmeetings%2F15ef939f-4654-4541-959d-51ba50b4d022%2F20230301T100000%2Fedit> to see them.
>>>>> 
>>>>>  
>>>>> Participants
>>>>> Organizers
>>>>> Phil Archer
>>>>> Markus Sabadello
>>>>> Groups
>>>>> RDF Dataset Canonicalization and Hash Working Group <https://www.w3.org/groups/wg/rch> (View Calendar <https://www.w3.org/groups/wg/rch/calendar>)
>>>>> Report feedback and issues on GitHub <https://github.com/w3c/calendar>.
>>>>> <event.ics>
>>>> 
>> 
> <OpenPGP_0x9D1EDAEEEF98D438.asc>

Received on Friday, 3 March 2023 20:23:28 UTC