Re: Embedded Content from Benjamin Young on 2014-10-15 (public-annotation@w3.org from October 2014)

From: Benjamin Young <bigbluehat@hypothes.is>
Date: Wed, 15 Oct 2014 08:34:44 -0400
To: Nick Stenning <nick@whiteink.com>
Cc: public-annotation@w3.org
Message-ID: <CAE3H5F+VX4NXRQO9ktWdnjAwLRrvUpnw_BQtSPD+ZMCA7Lghew@mail.gmail.com>

On Wed, Oct 15, 2014 at 7:37 AM, Nick Stenning <nick@whiteink.com> wrote:

> On Wed, Oct 15, 2014, at 07:07, Ivan Herman wrote:
> > Thanks for the pointer, Nick. I didn't realize it was that messy...
> >
>
> It was Randall that pointed out the mess, not me!
>
> That said, the article Randall linked to is about JavaScript's internal
> string encoding, which is -- as the article discusses -- a bizarre
> halfway house between UCS-2 and UTF-16.
>
> That shouldn't (AFAIK) affect the issue of mandated encodings for
> embedded content. User agents can still write unicode text from
> JavaScript onto the wire as UTF-8.
>
> As I understand it, the use case for embedding is as follows:
>
> "For small annotation bodies, the overhead associated with creating a
> concrete resource elsewhere on the web is unacceptable, so we want a way
> to embed sufficiently small bodies in the Annotation resource itself."
>
> If embedded bodies will be small, the advantages of UTF-16 over UTF-8
> for asian texts will be minimal, and thus I'd be in favour of omitting
> the character encoding and mandating UTF-8.
>

If JSON-LD is a prime target for representation (which it seems to be),
picking UTF-8 seems prudent.

Here's a quote from RFC 7159 (the latest JSON spec):

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default

   encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
   interoperable in the sense that they will be read successfully by the
   maximum number of implementations; there are many implementations
   that cannot successfully read texts in other encodings (such as
   UTF-16 and UTF-32).

Cite: http://tools.ietf.org/html/rfc7159#section-8.1

And another spec now in progress (also by Tim Bray) introduces the
idea of I-JSON (or "Internet JSON") which essentially outlines and
codifies the interoperability recommendations in RFC 7159. The key
piece of which for this conversation is...

   I-JSON messages MUST be encoded using UTF-8 [RFC3629
<https://tools.ietf.org/html/rfc3629>].

Cite: https://tools.ietf.org/html/draft-ietf-json-i-json-03#section-2.1

So...while UTF-16 is "allowed for" it seems it would only cause
implementation woes...sadly.

> That said, I will happily reverse my position if someone has evidence
> that omitting support for UTF-16 in embedded bodies will negatively
> affect adoption of our standard in China/Thailand/etc.
>

Likely the developers in that area are accustomed to the "fun" that is
UTF-* and are used to UTF-8 content limitations in particular.

Perhaps a stance such as "SHOULD be encoded using UTF-8, but MAY be encoded
in UTF-16 or UTF-32 where limiting interoperability is less of a concern."
Terribly vague, but you get the idea, hopefully. :)

Thanks!
Benjamin
--
Developer Advocate
http://hypothes.is/

>
> -N
>
>
>

Received on Wednesday, 15 October 2014 12:35:13 UTC