Re: [JSON] Constraining JSON serialization discussion from Manu Sporny on 2011-03-25 (public-rdf-wg@w3.org from March 2011)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Fri, 25 Mar 2011 11:38:32 -0400
To: public-rdf-wg@w3.org
Message-ID: <4D8CB6F8.5080606@digitalbazaar.com>
On 03/24/2011 11:19 PM, Peter Frederick Patel-Schneider wrote:
>> Constraint #1: The grammar that this WG MUST use is defined in RFC4627
> 
> OK.  Your opinion so far, but I think that you are right.
> Let's get a WG decision on this ASAP!

I created a new issue for this:

http://www.w3.org/2011/rdf-wg/track/issues/16

>> Constraint #2: The JSON data model is not defined across all programming
>> languages, and does not need to be in order to be useful for the work in
>> this WG.
> 
> It is not OK to me that the WG does not have a good written-down notion
> of what some piece of JSON means.  

"some piece of JSON means" is vague, you are going to get a vague
response for asking questions about "meaning". Similarly, responses to
"What is JSON?" are going to get the same confused responses.

> This doesn't have to be anything
> fancy, by the way, but I remain astonished that there is not some
> generally-agreed-on language-independent notion of what JSON is supposed
> to map to.

This is good enough for me:

"""
JSON is built on two structures:

* A collection of name/value pairs. In various languages, this is
  realized as an object, record, struct, dictionary, hash table, keyed
  list, or associative array.
* An ordered list of values. In most languages, this is realized as an
  array, vector, list, or sequence.
"""

Why is it not good enough for you?

>> According to RFC4627 it is valid, however many of the programming
>> languages use associative arrays to store their values, which require
>> unique keys. We MUST NOT depend on this functionality, it won't work
>> across all of the popular JSON implementations.
> 
> Is the WG supposed to be doing something that works across all of the
> popular JSON implementations? 

Yes.

> What are they? 

The most popular implementations for all of these languages - Java, C,
C++, C#, PHP, Python, Visual Basic, Objective-C, Perl, JavaScript and Ruby.

Those languages come from the TIOBE index:

http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

We're not going to have time to check every implementation in the time
that we have. This is why a test suite is vital. I do not think we need
near-perfect coverage to be successful.

> Should the WG
> investigate to see what data model these implementations impose on JSON? 

The people that have worked in JSON serializations for a few years
already know the answer to this for JavaScript, Python, C++ and a
handful of other languages. JavaScript is the most important here.

If we have the time yes, but most already know what data model is
imposed on the JSON serialization. I have a feeling that we could divide
an conquer on this... but I also think that after the 5th or 6th
implementation, that the rate of diminishing returns skyrockets. I think
we already know the answer to this question, but it may not hurt to put
up a Sandro-chart on the wiki.

Also: I like how, in my mind, Sandro is now credited with the invention
of 2-dimensional charts.

>>> Is this actually allowed in JSON?  If so, where is it stated?
>>
>> It is allowed per RFC4627, Section 2.5: Strings
> 
> Sure, spaces are allowed in strings.  
> But do the popular JSON implementations treat spaces in names
> as a shorthand?

No, which is why I think that this is a bad idea.

> This would be a very big surprise to me!  
> What other surprises are there?

For you, probably a few more. For me, not that many. For someone else,
who knows!?

I do see where you're going with this - how do we know what will and
will not be a surprise for developers? Like all programming language
design, we're going to have to make a few educated guesses and get wide
feedback on the spec from developers that use JSON serializations regularly.

We shouldn't let this prevent us from thinking that we can tackle this
problem, however.

> How can anyone tell what should be considered to be a surprise?

Personal context plays a very big part. That's why we're working on this
as a group with a diverse range of interests. That's why W3C has a
review process for these specs.

>>> The reviver, replacer, and space optional arguments appear to be able
>>> to greatly affect the situation.  Are these also part of the JSON 
>>> specification?
>>
>> They are a part of ECMA, and we won't have to ever mention them in the
>> spec we're creating. If we find that we do, we've screwed up.
>>
>>> Does the WG have to take them into account?
>>
>> No.
> 
> But they are part of a popular JSON implementation, so shouldn't the WG
> take them into account?

They are implementation details that largely do not have an effect on
how you program with de-serialized JSON data in JavaScript. Many people
use this data without knowing what the reviver/replace stuff does. To
put it another way, we should only pay attention to the stuff that
affects how programmers work with this serialization.

I do not think the reviver/replacer stuff fits into this category of
"things we should worry about" yet.

>>> Could the WG exploit them?
>>
>> In general - No. We should not do anything "fancy" or "exploit"ive with
>> JSON. There is a very high likelyhood that this will rathole the
>> conversation.
> 
> But what is fancy and should be avoided?  I was somewhat suprised that
> colons in names might be considered to be fancy!

The colons in names is something that we need to come to consensus on as
a group. I think they should be allowed, Nathan disagrees - I don't know
if anyone else really holds a strong position one way or the other.
Clearly this is a discussion we need to have.

>> The same way that JSON folks talk about round-tripping today. In it's
>> simplest form, you serialize something to JSON when you receive a GET
>> and send it out. If you get the same serialization back via a POST - the
>> two parties have accomplished a round-trip. 

I'm assuming that your original question of "is there a notion of
round-tripping in JSON?" was answered and that you've moved on to
another question... more below.

> How often is this (very strong) notion of successful round-tripping
> achieved?  This requires outputing the same whitespace and the same
> order of object members.

Not very often. The more common case is that the keys are shuffled
(because they're typically implemented as hashtables), or that
whitespace is compressed. So one serialization might do this:

{ "name" : "Peter", "foo": "bar" }

another may do this:

{ "foo": "bar", "name" : "Peter" }

and another may do this:

{"foo":"bar","name":"Peter"}

Three different serializations, but the content hasn't changed - so the
round-trip still succeeds. There are a few exceptions to this (mostly
because there are corner cases and bugs in particular implementations).
For example, the PHP json_encode() function used to back-slash escape
sequences that do not need to be escaped. For example:

{ "homepage": "http://example.org/peter" }

is serialized in some versions of PHP like this:

{ "homepage": "http:\/\/example.org\/peter" }

However, those are implementation details and are easily handled with a
few words about it in the specification.

>> Conformant implementations
>> are not supposed to change values between serializing and deserializing
>> (doubles, integers, booleans, etc.). 
> 
> Where did doubles and integers come into the mix?  I thought that JSON
> only had a single generic numeric, and didn't even have boolean as a
> syntactic category.

I want to make sure that we're not conflating the "JSON serialization
(RFC4627)" with "how the JSON serialization is interpreted in various
programming languages". I'm talking about the latter, you seem to be
thinking that I'm talking about the former.

I can't answer your question until you are more clear about what context
you would like your question to be answered in - "JSON serialization
(RFC4627)" or "an implementation of a JSON parser, and a round-trip, in
programming language X".

The JSON serialization defines numbers like this:

http://tools.ietf.org/html/rfc4627#section-2.4

These are all valid numbers in the JSON serialization:

5, -5, 5.5, 5e5, 5.4321E-555555555555555555555555555555555555

They can be represented in a number of different ways in various
programming languages and machine architectures.

The JSON serialization (RFC4627) does outline 'false' and 'true' as
valid values, see:

http://tools.ietf.org/html/rfc4627#section-2.1

>> This is a snag point in some cases
>> of serializing RDF values to JSON that we'll have to be careful with.
>> However, we can have that discussion without knowing what the object
>> model is, or by placing reasonable constraints on the object model if
>> necessary.
> 
> OK, but what are these "reasonable" constraints?

That question is too vague to answer because it lacks context. I could
say: That is for this WG and the reviewers to decide. Or I could say: A
reasonable constraint is one that doesn't cause regular JavaScript
developers to recoil in horror. Neither of those answer your question.

>> Yes, but doing this would be incredibly short-sighted of us, for all of
>> the reasons outlined. Not to mention that no JSON implementation would
>> do the right thing in this scenario, so even if we were clever - it
>> wouldn't work in all of the most common languages with JSON parsers.
> 
> Again, what is our target here?

If I had to pick one, it would be JavaScript developers that are
familiar with the JSON serialization format.

If I had to pick more than one, it would be Web developers that are
familiar with the JSON serialization format across a number of popular
programming languages as listed in the TIOBE index.

>>> To understand JSON this way is extraordinarily difficult and
>>> expensive, requiring deep knowledge of the innards of EMCAScript.
>>
>> You are asking questions that require a deep knowledge of the innards of
>> ECMAScript, or a few months of JavaScript programming experience.
> 
> No, or at least I hope not.  I thought that I was asking what the
> meaning of JSON was.  

You are asking a "What is the meaning of..." question. Here are a few
others:

"What is the meaning of RDF?"
"What is the meaning of OWL?"
"What is the meaning of XML?"

There are at least two approaches to answering those questions. One is
where you assume that the person asking isn't interested in the details,
so you gloss over them. The other is where you assume that the person
asking them is interested in the details, so you give them the firehose.

You got the firehose because the question you asked was vague, and based
on your position in the context of this WG required answers containing
all of the details.

>> At this point in your responses you get increasingly wary of JSON
>> because you're attempting to learn JavaScript simultaneously. 
> 
> Only because that's were the documents lead to.  I don't care (in this
> context, at least) about JavaScript.

Ah, but you do care about JavaScript because you want to learn why some
of this stuff is "surprising". It's unfortunate that that's where the
documents lead to... but that doesn't mean that this stuff doesn't make
sense and doesn't, in general, work.

>> You are
>> looking at a programming language specification in order to understand
>> /something/. I don't quite know what you're attempting to understand,
> 
> JSON, no more than that!

Too vague. Exactly what about JSON do you want to know? More about the
serialization grammar? How it is used by a particular implementation in
a particular programming language? Why the colon syntax is viewed as
problematic by some?

>> and start
>> asking more concrete questions. 
> 
> ... but this is where following my nose got me.

I understand, and that's unfortunate because I think it has done more
harm than good. I've seen this happen with a number of other people
before that expect the JSON specs to work like other Internet Standards
do. It looks like a duck, it quacks like a duck, but it isn't a duck.

>> We could spend a month discussing why
>> parts of the spec are written the way that they are, or how certain
>> scenarios don't make sense if you hold a certain world view. I have a
>> feeling that most of that is not going to help this group come to grips
>> with the serialization aspect. You may learn a great deal of things that
>> are not applicable to what we're attempting to accomplish here.
> 
> Well, here is where we part company.  I feel that I need to know what
> JSON syntactic structures are supposed to mean.  I asked a few
> questions, was pointed to the EMCA spec, and then ended up in the depths
> of EMCAScript parsing.  I'm certainly willing not to have to look at
> this document, but what else is there?

This:

"""
JSON is built on two structures:

* A collection of name/value pairs. In various languages, this is
  realized as an object, record, struct, dictionary, hash table, keyed
  list, or associative array.
* An ordered list of values. In most languages, this is realized as an
  array, vector, list, or sequence.
"""

... but I'm just shooting in the dark now. I don't know what you mean by
"I need to know what JSON syntactic structures are supposed to mean". Do
you mean philosophically? Do you mean how do programmers work with these
data structures in language X? Do you mean what is the lexical space and
the value space? "mean" could mean many things. That you continue to ask
the question makes me think that you are still searching for some answer
that hasn't been covered yet. Try rephrasing your question and you may
get a different answer.

>>> Well, I, for one, find it hard to work on standardizing against
>>> anything when I don't know the target.
>>
>> The target serialization format is RFC4627. ECMA absolutely SHOULD NOT
>> be the target, but should inform our direction. 
> 
> So then I should understand it!

It depends on what you're trying to understand. What is "it"?

>> ECMA is just ONE example
>> of how JSON works with a programming language's data model. There are
>> additional ones for Python, Ruby, C++, C, Haskell, Java, PHP, Perl, Lua,
>> Clojure and many other languages and data models.
> 
> Is there any commonality between them?  If so, what is it?  If not, then
> what can the WG do?

Yes, there is a commonality between them - "objects" and "arrays", I've
placed the text that should result in a eureka moment for you earlier in
this e-mail. If it doesn't, then I don't know what you're asking.

>> The person that responded to you is not correct. JSON objects often do
>> re-serialize as different sequences of characters. That, however, does
>> not mean that the data that they represent changes - often it does not.
>
> Often?  This sounds very scary to me.  I would be much happer with
> "almost always" and even happier with "always except ...".

I said often because there are buggy implementations. In non-buggy
implementations, it's "almost always". I only say "almost always"
because I don't feign to know everything about every JSON
serializer/deserializer implementation in every language that has a
serializer/deserializer. I would never say "always" in the response to
your question.

>>> - colons in names
>>
>> Allowed per RFC4627.
> 
> But claimed to be problematic because of dot notation.

Those responses were strictly in reference to RFC4627. You have switched
to talking about how the JSON serialization is interpreted and stored in
JavaScript's language model. That's fine, but I want to make sure that
you know that we've switched to a different topic.

Dot notation is specific to languages that expose the deserialized JSON
data as objects. With JavaScript, you can do this:

var j = JSON.parse("{\"num\": 5}");
// The next line will give you the number 5
j.num

With Python and simplejson, you must do this instead:

import simplejson as json
j = json.loads('{"num": 5}')

// You can't do the same thing that you do in JavaScript
j.num
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'a'

// However this works
j["num"]
5

So, colons-in-names matter if you want to use dot-notation in
JavaScript, but they do not matter for Python and simplejson because you
can't use dot-notation.

>>> - multiple values for properties
>>
>> Allowed per RFC4627, but all popular implementations don't support this
>> feature.
> 
> OK, problematic.  What else is problematic?

That is a vague question. It depends on what you're attempting to
accomplish.

>>> - spaces in names indicating multiple properties
>>
>> Allowed per RFC4627.
> 
> RFC doesn't say anything about this that I can find.

spaces in names are allowed per RFC4627. Using spaces in names to
indicate multiple properties is up to the implementation - this is
non-standard, we should not do this.

>>> - URIs as names
>>
>> Allowed per RFC4627.
> 
> OK, but also claimed to be problematic.

Again, it depends on your audience. It's not black and white.

> What I think is really needed for the WG to proceed much further is at
> least initial drafts for:
> 1/ effective syntax for JSON (RFC4627 gives the reference syntax, but
>    what are the problematic parts of this syntax)

We need to figure out which communities we're attempting to help first,
Richard's 4 use case groupings, before we talk about the syntax. The
syntax that we agree on is highly dependent on which groups we're
attempting to address. In many of these cases, we may avoid certain
discussions entirely based on which groups we decide to address.

Otherwise we end down a rathole talking about every possibility under
the sun.

> 2/ some notion of the meaning of JSON 

You are not going to get an answer to this question unless you put some
constraints on what you'd like to find out.

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Payment Standards and Competition
http://digitalbazaar.com/2011/02/28/payment-standards/
Received on Friday, 25 March 2011 15:39:06 UTC