- From: Doug Schepers <schepers@w3.org>
- Date: Wed, 15 Oct 2014 13:04:37 -0400
- To: Benjamin Young <bigbluehat@hypothes.is>, Nick Stenning <nick@whiteink.com>
- CC: public-annotation@w3.org
Hi, folks– I was wrong, and Nick and Benjamin were right. I suspected as much, but confirmed it with W3C's internationalization expert, Richard Ishida, who said: [[ http://www.w3.org/TR/encoding/#names-and-labels "New protocols and formats, as well as existing formats deployed in new contexts, must use the utf-8 encoding exclusively. If these protocols and formats need to expose the encoding's name or label, they must expose it as "utf-8". " See also http://www.w3.org/International/questions/qa-choosing-encodings Bottom line: use utf-8 everywhere, avoid everything else if you can. UTF-8 can address all characters in Unicode. Nick is right that "UTF-16 may be more space efficient for Japanese, Chinese, Thai, characters", but it has other problems, such as byte order. On the other hand, utf-8 has a huge win because it only requires 1 byte for ASCII, (UTF-16 requires 2) and exactly the same byte as ASCII, so the majority of code is both compatible and compact. ]] So, UTF-8 is the right choice. I don't mind being proved wrong if it so strongly and clearly validates UTF-8 as the encoding. :) Regards- -Doug On 10/15/14 8:34 AM, Benjamin Young wrote: > On Wed, Oct 15, 2014 at 7:37 AM, Nick Stenning <nick@whiteink.com > <mailto:nick@whiteink.com>> wrote: > > On Wed, Oct 15, 2014, at 07:07, Ivan Herman wrote: > > Thanks for the pointer, Nick. I didn't realize it was that messy... > > > > It was Randall that pointed out the mess, not me! > > That said, the article Randall linked to is about JavaScript's internal > string encoding, which is -- as the article discusses -- a bizarre > halfway house between UCS-2 and UTF-16. > > That shouldn't (AFAIK) affect the issue of mandated encodings for > embedded content. User agents can still write unicode text from > JavaScript onto the wire as UTF-8. > > As I understand it, the use case for embedding is as follows: > > "For small annotation bodies, the overhead associated with creating a > concrete resource elsewhere on the web is unacceptable, so we want a way > to embed sufficiently small bodies in the Annotation resource itself." > > If embedded bodies will be small, the advantages of UTF-16 over UTF-8 > for asian texts will be minimal, and thus I'd be in favour of omitting > the character encoding and mandating UTF-8. > > > If JSON-LD is a prime target for representation (which it seems to be), > picking UTF-8 seems prudent. > > Here's a quote from RFC 7159 (the latest JSON spec): > > JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default > > encoding is UTF-8, and JSON texts that are encoded in UTF-8 are > interoperable in the sense that they will be read successfully by the > maximum number of implementations; there are many implementations > that cannot successfully read texts in other encodings (such as > UTF-16 and UTF-32). > > Cite:http://tools.ietf.org/html/rfc7159#section-8.1 > > And another spec now in progress (also by Tim Bray) introduces the idea of I-JSON (or "Internet JSON") which essentially outlines and codifies the interoperability recommendations in RFC 7159. The key piece of which for this conversation is... > > I-JSON messages MUST be encoded using UTF-8 [RFC3629 <https://tools.ietf.org/html/rfc3629>]. > > Cite:https://tools.ietf.org/html/draft-ietf-json-i-json-03#section-2.1 > > So...while UTF-16 is "allowed for" it seems it would only cause implementation woes...sadly. > > > That said, I will happily reverse my position if someone has evidence > that omitting support for UTF-16 in embedded bodies will negatively > affect adoption of our standard in China/Thailand/etc. > > > Likely the developers in that area are accustomed to the "fun" that is > UTF-* and are used to UTF-8 content limitations in particular. > > Perhaps a stance such as "SHOULD be encoded using UTF-8, but MAY be > encoded in UTF-16 or UTF-32 where limiting interoperability is less of a > concern." Terribly vague, but you get the idea, hopefully. :) > > Thanks! > Benjamin > -- > Developer Advocate > http://hypothes.is/ > > > -N > > >
Received on Wednesday, 15 October 2014 17:04:46 UTC