Re: RDF-ISSUE-129 Re: json-ld-api: change proposal for handling of xs:integer from Sandro Hawke on 2013-05-13 (public-rdf-wg@w3.org from May 2013)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 13 May 2013 15:48:47 -0400
To: Markus Lanthaler <markus.lanthaler@gmx.net>
CC: 'W3C RDF WG' <public-rdf-wg@w3.org>
Message-ID: <5191439F.4080204@w3.org>
Just a couple short comments inline, for now.

On 05/13/2013 01:20 PM, Markus Lanthaler wrote:
> On Monday, May 13, 2013 11:25 AM, Gregg Kellogg wrote:
>> On May 13, 2013, at 4:36 AM, Sandro Hawke <sandro@w3.org> wrote:
>>
>>> [this is really two related issues -- one about xs:integer, then
>> other about xs:double, in JSON-LD.]
>>> On 05/12/2013 09:45 PM, Manu Sporny wrote:
>>>> On 05/10/2013 06:31 PM, Sandro Hawke wrote:
>>>>> I believe, by saying in situations where there might be a loss, one
>>>>> MUST NOT convert to a number.
>>>> We didn't do this because the range for a JSON number isn't defined
>>>> anywhere.
> Right. JSON-LD the data format doesn't has this issue as it has an unlimited
> value space. So it's really just problematic for systems converting those
> strings (even the things without quotes are strings on the wire) to numbers.
>
>
>>>>> It's true we don't know exactly when there might be a loss, but
>> after
>>>>> talking with Markus, I'm pretty confident that using the range of
>>>>> 32-bit integers will work well.
>>>> ... except that most systems support 64-bit numbers, and we'd be
>>>> hobbling those systems. :/
> And problem is still there for 16bit or 8bit systems. That might not matter
> much in practice but in a couple of years the 32bit limit won't matter
> anymore - just as 16bit or 8bit don't do much anymore today.
>
>
>>> Yes, but I'm not sure the demand is *that* great for efficient
>> handling of integers outside the range of 32-bits.      We're hobbling
>> their handling of numbers in the range of +- (2^31...2^53), for the
>> most part.
>>> But yes, there is a tradeoff of efficiency against correctness.
>>>
>>> I can't help wondering how the JSON standards community thinks about
>> this.  It seems like a huge problem when transmitting JSON to not know
>> if bits will be dropped from your numbers because the receiving system
>> is using a different-from-expected representation of numbers.
> Typically large numbers are represented as strings. Twitter run into that
> problem when their Tweet IDs crossed 53bit a couple of years ago. They are
> now serializing it as both a number and a string; id and id_str, see:
>
> https://dev.twitter.com/docs/twitter-ids-json-and-snowflake
>
> In JSON-LD we have a way to add a type to such a string-number, so that
> shouldn't be a big problem.
>
>
>> The point of being able to use native numbers in JSON is that this is
>> much more convenient for JSON developers to use than strings, which
>> might still need tom be evaluated. But it is impossible to do this for
>> every possible integer. I think that restricting this to 32 bits is a
>> reasonable restriction, given the limitations of important JSON
>> parsers, but requiring the use of BigInteger-like libraries should be
>> considered.
> We need to distinguish between the data format (the thing on the wire) and
> processors. On the wire the range and precision is unlimited. Processors
> converting that to some native type of course have limitations but as Gregg
> said that limit can be stretched quite far these days... even though it
> makes implementations much more complicated as off-the-shelves JSON parsers
> don't do this (yet). PHP allows e.g. to parse large numbers into strings so
> that nothing is lost (except the info that it was a number and not a
> string).

Losing the fact that it was a number and not string == corrupted data.

>
>>>> We might want to put in guidance that moves the decision to the
>>>> processor (it can detect when a conversion would result in data
>> loss).
>>>> Perhaps it should be up to the implementation to determine when data
>>>> could be lost.
> That would be my preferred solution.
>
>
>>> The problem is:
>>>
>>> step 1:  64-bit server pulls data out of its quadstore and serializes
>> it as JSON-LD
>>> step 2:  Server sends that JSON-LD to client
>>> step 3:  32-bit client uses that data.
>>>
>>> If the server is using native json numbers, and some number is in the
>> 2^31...2^53 range, then the client will silently parse out the wrong
>> number.    That's a pretty bad failure mode.    I'm not sure whether
>> people will react by:
>>>   - not using native json numbers for that range (as I'm suggesting)
>>>   - insisting that clients handle json numbers the same as the server
>> does (somehow)
>>>   - not using native json numbers at all
>>>   - not using json-ld at all
>>>
>>> I suspect if we give no guidance, the we'll find ourselves at the
>> later options.
> I don't agree with that reasoning. JSON does exactly the same and I haven't
> heard from people stopping using it because of that. Yeah, in some cases it
> might be better to serialize numbers as strings but in contrast to JSON,
> JSON-LD allows to add a datatype - so it won't be an opaque string as in
> JSON.
>
>
>> Prefer the second option, but could live with the first.
>>
>>>>> I'd also add:
>>>>>
>>>>> "1"^^xs:int              // not native since it's 'int' not
>>>>> 'integer' "01"^^xs:integer     // not native since it's not in
>>>>> canonical form
>>>> +1
> So we are just converting numbers in *canonical* lexical form? Would be fine
> with that.
>

If we want perfect round-tripping, yes, we have to only convert numbers 
which happen to be in canonical form.

>>>>> These rules will make xs:integer data round tripping through JSON-
>>>>> LD perfectly lossless, I believe, on systems that can handle at
>>>>> least 32 bit integers.
>>>> Yeah, but I'm still concerned about the downsides of limiting the
>>>> number to 32-bits, especially since most of the world will be using
>>>> 64-bit machines from now on.
> Me too... and in a couple of years the same will be true about 64bit.
>
>
>>> Another option is to say JSON LD processors MUST retain at least 53
>> bits of precision on numbers (my second option above), but Markus tells
>> me PHP compiled for 32-bit hardware, and some C JSON parsers, wont do
>> that.
> -1, that will make it impossible to implement conformant JSON-LD processors
> on certain platforms.
>
>
>> Likely, languages with these limitations have some kind of BigInteger
>> implementation; if so, we could consider using the 64-bit space.
>>
>>>> I do agree that we might be able to change the text to ensure that
>>>> precision loss isn't an issue, and I do agree with you that it's
>>>> definitely worth trying to prevent data loss.
>>>>
>>>> Tracking the issue here:
>>>>
>>>> http://lists.w3.org/Archives/Public/public-rdf-wg/2013May/0136.html
>>>>
>>>>> On a related topic, there's still the problem of xs:double.  I
>> don't
>>>>> have a good solution there.   I think the only way to prevent
>>>>> datatype corruption there is to say don't use native number when
>> the
>>>>> value happens to be an integer.
>>>> I don't quite understand, can you elaborate a bit more? Do you mean,
>>>> this would be an issue?
>>>>
>>>> "234.0"^^xsd:double --- fromRDF() ---> JsonNumber(234)
>>> Yes.
> "234.0"^^xsd:double --- fromRDF() ---> JsonNumber(234) --> toRDF
> "234"^^xsd:integer
>
>
>>>   Option 0: leave as-is.   RDF data cannot be faithfully transmitted
>> through JSON-LD if 'use native numbers' is turned on.
> That's what the flag is for. I'm wondering how other RDF libraries handle
> that!? For example, what happens if you call Jena's getInt() with an integer
>> 32bit? Will it throw an exception?
> http://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/
> Literal.html
>
>
>>>   Option 1: in converting RDF to JSON-LD, processors MUST NOT use
>> native json numbers for xs:double literals whose values happen to be
>> integers.  Leave them in expanded form.
> That would be a very weird and surprising behavior for most users.
>
>
>>>   Option 2: in converting between RDF and JSON-LD, processors SHOULD
>> handle the JSON content as a *string* not an object.  When they
>> serialize as double, they SHOULD make sure the representation includes
>> a decimal point.  When they parse, they should map numbers with a
>> decimal point back to xs:double.   Also, when they parse, they should
>> notice numbers that are too big for the local integer representation
>> and keep them in a string form.
> Isn't that exactly what useNativeTypes = false does?
>
>
>>> FWIW, I hate all of these options.   I can't even decide which I hate
>> the least.   Seriously hoping someone has a better idea....
>>
>> The point of having the useNativeTypes flag is to address these issues,
>> hobbling the implementations for all implementations to guarantee no
>> data loss goes against the whole point of using a JSON representation
>> in the first place; the format is optimized for applications,
> I think we should keep in mind that we are primarily designing a data
> format. The data has none of these issues as numbers can be of arbitrary
> size and precision. The problem manifests itself when those numbers are
> converted to some native representation. You have the same problem
> anywhere.. plain-old JSON, XML, etc. I think we should just add a note or
> something highlighting the problem and explaining the various approaches to
> avoid it.
>
>
>> Any JSON-LD processor can faithfully transform from other RDF formats
>> by turning off the useNativeTypes option; the only thing to consider is
>> if this guidance needs to be made more pro intent and if we should
>> consider changing the default for that option.
> +1.. don't care much about the default value.
>
>
>> Option 0 preserves the intent of the format the best, but developers
>> should be aware that, for the sake of convenience and utility,
>> developers should recognize the possibility of round-tripping errors.
> +1, that's how JSON has been successfully used for years.
>
>
>> Option 1 is much more inconvenient for developers, as their code now
>> needs to branch if the value is a string or hash, rather than just
>> count on its being a number.
> -1, very unintuitive behavior
>
>
>> Option 2 places more of a burden on processor developers. In Ruby, I'd
>> need to always use custom datatypes for numbers to carry around the
>> original lexical representation, but this could be easily lost through
>> intermediate operations. I'd also need a custom JSON parser and
>> serializer to ensure that the serialized form is represented properly,
>> not worth it IMO.
> Just use useNativeTypes = false if you want that behavior. Requiring
> implementers to write their own JSON parsers is not an option in my opinion.

Sorry, the flag doesn't really help in the scenario I provided above, 
where someone is serving JSON-LD to an unknown client.  I would argue 
this is the expected, majority use case.  (I guess the other one that 
might be common is reading RDF and converting it to JSON-LD for internal 
use as a kind API to the data?)  And in this use case, if the server 
sets useNativeTypes=true, it's going to be providing data that for some 
data values the client will get either the wrong RDF data value and/or 
the wrong RDF data type.

In other words with useNativeTypes turned on, JSON-LD is not a faithful 
RDF syntax.

Given that -- which you all seem invested in -- maybe we should go all 
the way and convert all RDF numeric literals to native JSON numbers.  It 
makes the lossy-but-convenient conversion even more convenient and lossy 
in a less-surprising way.    Rather than weirdly having *some* doubles 
turned into integers in the rdf->json->rdf round trip, we'd just have (I 
propose) EVERY numeric literal turned into an xs:double.   Certainly 
that's what pretty much every JavaScript coder would want/expect with 
useNativeTypes=true.

I'd also suggest we say the people SHOULD NOT publish JSON-LD with 
json-native numbers in it, unless they're fine with them being 
understood in a platform-dependent way.

       -- Sandro

>
>
> --
> Markus Lanthaler
> @markuslanthaler
>
>
>
Received on Monday, 13 May 2013 19:48:55 UTC