Re: Embedded Content

Hi, folks–

I was wrong, and Nick and Benjamin were right.

I suspected as much, but confirmed it with W3C's internationalization 
expert, Richard Ishida, who said:

[[
http://www.w3.org/TR/encoding/#names-and-labels

"New protocols and formats, as well as existing formats deployed in new 
contexts, must use the utf-8 encoding exclusively. If these protocols 
and formats need to expose the encoding's name or label, they must 
expose it as "utf-8". "

See also http://www.w3.org/International/questions/qa-choosing-encodings

Bottom line: use utf-8 everywhere, avoid everything else if you can.

UTF-8 can address all characters in Unicode.

Nick is right that "UTF-16 may be more space efficient for Japanese, 
Chinese, Thai, characters", but it has other problems, such as byte 
order. On the other hand, utf-8 has a huge win because it only requires 
1 byte for ASCII, (UTF-16 requires 2) and exactly the same byte as 
ASCII, so the majority of code is both compatible and compact.
]]

So, UTF-8 is the right choice.

I don't mind being proved wrong if it so strongly and clearly validates 
UTF-8 as the encoding. :)

Regards-
-Doug

On 10/15/14 8:34 AM, Benjamin Young wrote:
> On Wed, Oct 15, 2014 at 7:37 AM, Nick Stenning <nick@whiteink.com
> <mailto:nick@whiteink.com>> wrote:
>
>     On Wed, Oct 15, 2014, at 07:07, Ivan Herman wrote:
>     > Thanks for the pointer, Nick. I didn't realize it was that messy...
>     >
>
>     It was Randall that pointed out the mess, not me!
>
>     That said, the article Randall linked to is about JavaScript's internal
>     string encoding, which is -- as the article discusses -- a bizarre
>     halfway house between UCS-2 and UTF-16.
>
>     That shouldn't (AFAIK) affect the issue of mandated encodings for
>     embedded content. User agents can still write unicode text from
>     JavaScript onto the wire as UTF-8.
>
>     As I understand it, the use case for embedding is as follows:
>
>     "For small annotation bodies, the overhead associated with creating a
>     concrete resource elsewhere on the web is unacceptable, so we want a way
>     to embed sufficiently small bodies in the Annotation resource itself."
>
>     If embedded bodies will be small, the advantages of UTF-16 over UTF-8
>     for asian texts will be minimal, and thus I'd be in favour of omitting
>     the character encoding and mandating UTF-8.
>
>
> If JSON-LD is a prime target for representation (which it seems to be),
> picking UTF-8 seems prudent.
>
> Here's a quote from RFC 7159 (the latest JSON spec):
>
> JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
>
>     encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
>     interoperable in the sense that they will be read successfully by the
>     maximum number of implementations; there are many implementations
>     that cannot successfully read texts in other encodings (such as
>     UTF-16 and UTF-32).
>
> Cite:http://tools.ietf.org/html/rfc7159#section-8.1
>
> And another spec now in progress (also by Tim Bray) introduces the idea of I-JSON (or "Internet JSON") which essentially outlines and codifies the interoperability recommendations in RFC 7159. The key piece of which for this conversation is...
>
>     I-JSON messages MUST be encoded using UTF-8 [RFC3629  <https://tools.ietf.org/html/rfc3629>].
>
> Cite:https://tools.ietf.org/html/draft-ietf-json-i-json-03#section-2.1
>
> So...while UTF-16 is "allowed for" it seems it would only cause implementation woes...sadly.
>
>
>     That said, I will happily reverse my position if someone has evidence
>     that omitting support for UTF-16 in embedded bodies will negatively
>     affect adoption of our standard in China/Thailand/etc.
>
>
> Likely the developers in that area are accustomed to the "fun" that is
> UTF-* and are used to UTF-8 content limitations in particular.
>
> Perhaps a stance such as "SHOULD be encoded using UTF-8, but MAY be
> encoded in UTF-16 or UTF-32 where limiting interoperability is less of a
> concern." Terribly vague, but you get the idea, hopefully. :)
>
> Thanks!
> Benjamin
> --
> Developer Advocate
> http://hypothes.is/
>
>
>     -N
>
>
>

Received on Wednesday, 15 October 2014 17:04:46 UTC