Re: Trailing content in JSON-LD from Gregg Kellogg on 2015-08-23 (public-linked-json@w3.org from August 2015)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sun, 23 Aug 2015 10:34:06 -0700
To: Andy Seaborne <andy@seaborne.org>
Cc: Linked JSON <public-linked-json@w3.org>
Message-Id: <AFFCC140-40E2-4247-BD68-AD99F1CEEC21@greggkellogg.net>
> On Aug 23, 2015, at 3:15 AM, Andy Seaborne <andy@seaborne.org> wrote:
> 
> I'm having trouble pinning down what the spec status is of this input (this is for an issue in jsonld-java).
> 
> Does the trailing content mean it is illegal JSON-LD or not or is it outside the spec altogether in some cases?
> 
> ----------------------
> {
>  "@id" : "http://example/s",
>  "http://example/p" : "str"
> }
> xxxxxxxxx
> ----------------------
> 
> The question is whether the whole input is the "JSON Document" or whether the trailing junk is considered to be outside the JSON Document.
> 
> In the first case, it is a parse error, and any output is undefined.
> In the second case, there would be triples and no parse error.
> 
> I currently think that the spec says this is illegal JSON-LD but the argument is convoluted and relies on the input coming from HTTP.  If it were some other source (a file with a non jsonld extension [tut, tut]), it is unstated.
> 
> The spec chase:
> 
> Section 8 =>
> 
> """
> A JSON-LD document MUST be a valid JSON document as described in [RFC4627].
> 
> A JSON-LD document MUST be a single node object or an array whose elements are each node objects at the top level.
> “""

Adding “exclusive of enclosing whitespace".

> RFC4627 is the media type registration for JSON.
> 
> The definition link for "JSON-LD document" is descriptive:
> """
> A JSON-LD document serializes a generalized RDF Dataset [RDF11-CONCEPTS], which is a collection of graphs that comprises exactly one default graph and zero or more named graphs.
> """
> 
> so it does not say, to my reading, that the "JSON-LD document" includes or excludes the content after the "}”.

I believe a JSON-LD document is the entire document, and in my case, anyway, the entire document is passed to the Ruby JSON parser, where I expect to see an Array or Object. Non-whitespace trailing characters would be a syntax error.

> RFC4627 talks about a "JSON text" when defining the media type.
> Because that is the whole of the HTTP body, I think it means that "JSON text" includes everything. Then "MUST be a single node object" applies => it's a parse error.

+1

> Proposed spec fix 1:
> If it said that
> """
> A JSON-LD document MUST be a valid JSON *text* as described in [RFC4627].
> """
> 
> then it would be clearer but still only applies if the media type can be invoked and sometimes it can't (e.g a stream of chars from a non-HTTP stream).

Yes, I think this is clear.

> A sentence in the grammar explicitly, making it a synatx isse, not a context issue, stating that no trailing content is permitted would cover all cases.

With the provision that leading and trailing whitespace is permitted.

However, as a practical matter, JSON may be included in and HTML script tag, which could conceivably be in CDATA. Sometimes other non-JSON comment (such as a // comment) is also found). Because these are seen in the wild, my reader removes everything preceding “{“ or “[“ and everything trailing “}” or “]” to look for a valid JSON document. The specific substitution pattern I use is the following:

input.to_s.sub(%r(\A[^{\[]*)m, '').sub(%r([^}\]]*\Z)m, ‘')

While this is technically invalid IMO, practically speaking not eating such garbage will break real-world usage (perhaps mostly in schema.org examples). I could see generating an error if this is seen when validating, but otherwise I’m inclined to eat such garbage in my implementation.

Gregg

>  Andy
>
Received on Sunday, 23 August 2015 17:34:36 UTC