RE: HTML5 Issue 11 (encoding detection): I18N WG response... from Phillips, Addison on 2009-08-20 (public-html@w3.org from August 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Thu, 20 Aug 2009 00:15:01 -0700
To: Henri Sivonen <hsivonen@iki.fi>
CC: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01ACCE9B94@EX-SEA5-D.ant.amazon.com>

Hello Henri,

(A personal response)

> >
> > 1. It isn't clear what constitutes a "legacy" or "non-legacy
> > environment". We think that, for modern implementations, a bare
> > recommendation of UTF-8 would be preferable.
> 
> The Web is a legacy environment. New walled gardens that are saving
> in
> R&D cost by using HTML but that don't have any interop requirements
> are non-legacy environments. IIRC, this wording exists only as a
> politically correct fig leaf.

As mentioned previously, the existing wording is impenetrable and should be removed. I am open to removing the UTF-8 recommendation (although I personally think it has utility). But nothing is served by text that is unclear.

> 
> > Otherwise, return an implementation-defined or user-specified
> > default character encoding, with the confidence tentative. The
> UTF-8
> > encoding is recommended as a default.
> 
> This recommendation, while politically correct, is useless to
> implementors.

I disagree. Setting UTF-8 as a default may produce the dreaded "black diamonds" on the screen. But so too choosing the wrong encoding at (relative) random. This is after all, after everything, even autodetection, has failed. Some encoding must be used to interpret the bytes into characters. Why prefer a legacy encoding here?

> 
> > The default may also be set according to the expectations and
> > predominant legacy content encodings for a given demographic or
> > audience. For example, windows-1252 is recommended as the default
> > encoding for Western European language environments. Other
> encodings
> > may also be used. For example, "windows-949" might be an
> appropriate
> > default in a Korean language runtime environment.
> 
> I think this wording would be an improvement.

I think so too. And I agree that nearly every implementor will follow this path.

> 
> > 4. We suggest adding to step (6) this note:
> >
> > --
> > Note: The UTF-8 encoding has a highly detectable bit pattern.
> > Documents that contain bytes > 0x7F which match the UTF-8 pattern
> > are very likely to be UTF-8, while documents that do not match it
> > definitely are not. While not full autodetection, it may be
> > appropriate for a user-agent to search for this common encoding.
> 
> 
> I think adding this note makes sense.

Thanks.

Best Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

Received on Thursday, 20 August 2009 07:15:49 UTC