- From: Benjamin Young <bigbluehat@hypothes.is>
- Date: Wed, 15 Oct 2014 08:34:44 -0400
- To: Nick Stenning <nick@whiteink.com>
- Cc: public-annotation@w3.org
- Message-ID: <CAE3H5F+VX4NXRQO9ktWdnjAwLRrvUpnw_BQtSPD+ZMCA7Lghew@mail.gmail.com>
On Wed, Oct 15, 2014 at 7:37 AM, Nick Stenning <nick@whiteink.com> wrote: > On Wed, Oct 15, 2014, at 07:07, Ivan Herman wrote: > > Thanks for the pointer, Nick. I didn't realize it was that messy... > > > > It was Randall that pointed out the mess, not me! > > That said, the article Randall linked to is about JavaScript's internal > string encoding, which is -- as the article discusses -- a bizarre > halfway house between UCS-2 and UTF-16. > > That shouldn't (AFAIK) affect the issue of mandated encodings for > embedded content. User agents can still write unicode text from > JavaScript onto the wire as UTF-8. > > As I understand it, the use case for embedding is as follows: > > "For small annotation bodies, the overhead associated with creating a > concrete resource elsewhere on the web is unacceptable, so we want a way > to embed sufficiently small bodies in the Annotation resource itself." > > If embedded bodies will be small, the advantages of UTF-16 over UTF-8 > for asian texts will be minimal, and thus I'd be in favour of omitting > the character encoding and mandating UTF-8. > If JSON-LD is a prime target for representation (which it seems to be), picking UTF-8 seems prudent. Here's a quote from RFC 7159 (the latest JSON spec): JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Cite: http://tools.ietf.org/html/rfc7159#section-8.1 And another spec now in progress (also by Tim Bray) introduces the idea of I-JSON (or "Internet JSON") which essentially outlines and codifies the interoperability recommendations in RFC 7159. The key piece of which for this conversation is... I-JSON messages MUST be encoded using UTF-8 [RFC3629 <https://tools.ietf.org/html/rfc3629>]. Cite: https://tools.ietf.org/html/draft-ietf-json-i-json-03#section-2.1 So...while UTF-16 is "allowed for" it seems it would only cause implementation woes...sadly. > That said, I will happily reverse my position if someone has evidence > that omitting support for UTF-16 in embedded bodies will negatively > affect adoption of our standard in China/Thailand/etc. > Likely the developers in that area are accustomed to the "fun" that is UTF-* and are used to UTF-8 content limitations in particular. Perhaps a stance such as "SHOULD be encoded using UTF-8, but MAY be encoded in UTF-16 or UTF-32 where limiting interoperability is less of a concern." Terribly vague, but you get the idea, hopefully. :) Thanks! Benjamin -- Developer Advocate http://hypothes.is/ > > -N > > >
Received on Wednesday, 15 October 2014 12:35:13 UTC