Re: rdf:JSON from Gregg Kellogg on 2023-11-03 (public-rdf-star-wg@w3.org from November 2023)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Fri, 3 Nov 2023 10:09:58 -0700
To: "Peter F. Patel-Schneider" <pfpschneider@gmail.com>
Cc: Antoine Zimmermann <antoine.zimmermann@emse.fr>, RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-Id: <3AC3B768-E3BA-4D62-A63C-BA5709E2C85F@greggkellogg.net>
> On Nov 3, 2023, at 8:33 AM, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote:
> 
> On 11/3/23 05:11, Antoine Zimmermann wrote:
>> Peter,
>> The way you put it is strange: you first say that all JSON should be allowed, then immediately show that it would be a problem.
>> So, can you explain why it *should* be so?
>> --AZ
> The short version is that disallowing parts of JSON is worse than allowing all of JSON but that allowing all of JSON has difficulties that need to be addressed.  For a longer version see below.
> 
> 
> 
> Disclaimer:  The following is all my understanding of JSON, Javascript, Unicode, and the various ECMA and RDF documents on them.  I've spent a lot of time trying to make sense of all of them but I don't consider myself a full expert.
> 
> 
> The basic idea underlying JSON is quite simple.  JSON allows computer-language-indpendent transfer of values, which in turn are either strings, numbers, arrays, or objects.
> 
> The problems arise when one actually tries to use JSON.  What are JSON strings? What are JSON numbers?  What are JSON objects?  JSON documentation by and large leads one to believe that the answers are all idealistic - JSON strings are Unicode strings, JSON numbers are not limited in range or precision, JSON arrays are sequences of values, and JSON objects are bags (or, more likely, sets) of string-value pairs.
> 
> But the initial uses of JSON were in Javascript - JSON is the the JavaScript Object Notation after all.  So the historical answers are different - JSON strings are supposed to be Javascript strings, JSON numbers are supposed to be Javascript numbers, JSON objects are supposed to be Javascript objects.
> 
> Looking further into Javascript ends up with the following.  JSON strings are supposed to be finite sequences of UTF-16 code units.  JSON numbers are supposed to be IEEE floating point double with the lexical-to-value mapping as in Javascript.  JSON objects are supposed to be finite maps from JSON strings to JSON values.
> 
> 
> This state of affairs has been codified by RFC 8785 JSON Canonicalization Scheme (JCS) https://www.rfc-editor.org/rfc/rfc8785, which itself depends on The I-JSON Message Format https://www.rfc-editor.org/rfc/rfc7493.  I-JSON is a syntactic restriction of JSON that forbids duplicate names in objects, prohibits both Unicode surrogate code points and Unicode noncharacters, and suggests that numbers be restricted to those that map nicely onto IEEE floating point double.
> 
> So I-JSON prohibits
> 
> "\uDEAD"
> 
> and
> 
> {"a": 1, "a": 2}
> 
> and suggests not using
> 
> 3.141592653589793238462643383279
> 
> and
> 
> 1E1000
> 
> but allows
> 
> "\uD800\uDEAD"
> 
> Why is the last allowed?  Because Javascript strings are UTF-16 code units and JSON string escapes only allow 16-bit escapes so many Unicode characters have to be escaped as pairs of Unicode surrogate characters.
> 
> So RFC 8785 provides the basis of an RDF dataype for I-JSON.  In essence one takes I-JSON syntax, processes it as it would be processed in Javascript, and outputs it as it would be printed in Javascript.  OK so far, except that there are some peculiarities that expose the fact that Javascript uses UTF-16 internally.  These are (only) (very) annoying.

The notion of an RDF string is based on code points, rather than a specific encoding. Also, note that JSON is now restricted to UTF-8, even though ECMAScript uses UTF-16. We also have other limitations on what can be in a string. We might say that the lexical space of an rdf:JSON literal is an RDF string conforming to the grammar defined in RFC 8259 with lexical values not expressible as RDF strings being undefined.

> But there are two problems.  First, it is unclear what actually counts as I-JSON.  Should all JSON numbers be allowed?  Or only JSON numbers that nicely map into IEEE floating point double?  Second, what about the prohibited parts of JSON?

Based on what I said above, all numbers that can be expressed using the JSON grammar would be valid lexical values, but not all such numbers can be mapped to the value space. Is it reasonable to have lexical forms which do not map to the value space? In practice, this is the case for any real-world JSON parser now. PR 66 [1] attempts to define a minimal value space [2] based on abstract JSON values, rather than specific UTF-16 strings or IEEE floating point numbers.

> It is not so hard to do the same thing that RFC 8785 does except for all of JSON.  One could just say "do as Javascript does" or one could extend RFC 8785 by saying that for repeated names in object the last one is taken, JSON strings are sequences of UTF-16 code units, and all JSON numbers are allowed.

Just as XSD datatypes may be used for literals that do not conform to the requirements of a particular datatype (say “1.0e1”^^xsd:integer), some JSON values can be expressed that are not valid. If the INFRA standard is relied upon for the value space, that would put it into UTF-16. But, note that the JSON-LD WG defined the value space as being the JCS serialization of the lexical form.

> There is also the question of what the value space for rdf:JSON is.   JSON-LD uses strings but it seems to me that the value space for rdf:JSON should be the data that at JSON text encodes, i.e., a recursively defined datatype such as

PR 66 [1] lays out a value space based on abstract notions of arrays, maps, numbers, booleans, and strings whereas JSON-LD used the JCS serialization as the value space, so there is value in limiting the value space to be the JCS serialization.


> Definition:  The value space for rdf:JSON is recursively defined as finite sequences of Unicode UTF-16 code units, IEEE floating point numbers excluding infinities and not-a-numbers, finite sequences of elements of the value space for rdf:JSON, or finite mappings from finite sequences of Unicode UTF-16 code units to elements of the value space for rdf:JSON.
> 
> or
> 
> Definition:  The value space for rdf:JSON is recursively defined as finite sequences of Unicode code points, IEEE floating point numbers excluding infinities and not-a-numbers, finite sequences of elements of the value space for rdf:JSON, or finite mappings from Unicode code points to elements of the value space for rdf:JSON.
> 
> or
> 
> Definition:  The value space for rdf:JSON is recursively defined as finite sequences of Unicode code points, IEEE floating point numbers, finite sequences of elements of the value space for rdf:JSON, or sets of pairs of Unicode code points and elements of the value space for rdf:JSON.

Having the value space defined as sequences of code points does not lend it self to actually using these values in code, similar to how a DOM can be used for HTML or XML. Without sticking to abstract notions, your first definition is closest to what is actually implemented by JSON parsers.

Note that a given rdf:JSON literal expresses a single value, which may be recursive. A sequence of values would be expressed as an array.

This value space does not aid in SPARQL comparison operators, other than equal. For a notion of greater than or less than an order must be established. To be compatible with exceptions from JSON-LD 1.1, the order probably needs to be based on comparing strings based on the JCS representation of these values. It’s not clear that _any_ use of the value space in RDF requires anythiing other than the JCS representation.

> peter
> 

Gregg

[1] https://github.com/w3c/rdf-concepts/pull/66
[2] https://pr-preview.s3.amazonaws.com/w3c/rdf-concepts/pull/66.html#section-json
Received on Friday, 3 November 2023 17:10:17 UTC