W3C home > Mailing lists > Public > public-schemaorg@w3.org > June 2015

Re: HTML Entities and Escaping in JSON-LD Literals

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Fri, 19 Jun 2015 09:42:55 -0700
Cc: "schema.org Mailing List" <public-schemaorg@w3.org>, W3C Web Schemas Task Force <public-vocabs@w3.org>, Manu Sporny <msporny@digitalbazaar.com>
Message-Id: <AE96E032-C9F4-4212-B2EA-74A99773F3D3@greggkellogg.net>
To: mfhepp@gmail.com
> On Jun 19, 2015, at 3:01 AM, mfhepp@gmail.com wrote:
> 
> Dear all:
> 
> I think we need to clarify in the documentation of schema.org whether HTML entities and UTF numerical HTML encoding of an Unicode character in literals, namely text, should/can be kept as they are or need to be unescaped inside JSON-LD values. I assume the answer might be different for 
> 
> a) stand-alone JSON-LD documents and 
> b) when JSON-LD is embedded inside HTML via <script> elements.
> 
> In particular, I would like to know whether they must, should, and can be left in their HTML-encoded forms.
> 
> Literals provided by backend databases will often be encoded for HTML environments and e.g. contain HTML entity encodings like &amp; for the ampersand character or UTF numerical HTML encoding of an Unicode character, like &#160; for a non-breaking space.
> 
> Developers will often face the task of reusing a template variable that contains such escaped characters in JSON-LD code in <script> elements.

The issue of escaping text in a script tag is not related to JSON-LD, and it is the HTML spec which needs to be considered [1]. There is no indication that entities are treated any differently. They recommend escaping characters with might otherwise terminate the script tag. My practice is to surround the content with <!— —>, which is removed when parsed. For XHTML, using <![CDATA[ … ]]> is necessary. These escapes are removed by my RDFa parser before handing the content off to JSON-LD, which is how the Linter handles it.

> The Google Structured Data Testing Tools seems pretty tolerant with this, but I would like to know the proper way of encoding text in JSON-LD values
> 
> The only guidance I found online was the simple statement
> 
>    "Depending on how the HTML document is served, certain strings may need to be escaped."
> 
> in 
> 
>    http://www.w3.org/TR/json-ld/
> 
> To make things more complicated, it seems that JSON-LD introduces novel escaping requirements for <, >, @ and ^:
> 
>    http://json-ld.org/spec/ED/json-ld-syntax/20100529/#escape-character

That’s a reference to a very old draft of JSON-LD (~2010). The published specification includes no such escaping rules.

Gregg

[1] http://www.w3.org/TR/html/scripting-1.html#restrictions-for-contents-of-script-elements

> Does anybody know a definite reference for this?
> 
> Best wishes
> 
> Martin
> 
> -----------------------------------
> martin hepp  http://www.heppnetz.de
> mhepp@computer.org          @mfhepp
> 
> 
> 
> 
> 
> 
> 
> 
Received on Friday, 19 June 2015 16:43:26 UTC

This archive was generated by hypermail 2.3.1 : Friday, 19 June 2015 16:43:27 UTC