RE: HTML5 Issue 11 (encoding detection): I18N WG response... from Phillips, Addison on 2009-08-20 (public-i18n-core@w3.org from July to September 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Thu, 20 Aug 2009 00:06:13 -0700
To: Maciej Stachowiak <mjs@apple.com>
CC: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01ACCE9B8E@EX-SEA5-D.ant.amazon.com>
What follows is a personal reply. 

> >
> > 1. It isn't clear what constitutes a "legacy" or "non-legacy
> > environment". We think that, for modern implementations, a bare
> > recommendation of UTF-8 would be preferable.
> 
> That recommendation is not suitable for compatible processing of
> the
> public web. I don't believe any browser is prepared to implement
> such
> a requirement or recommendation. I don't think it makes sense to
> make
> a recommendation that is unlikely to be followed.

Your current text says the same thing, only in an impenetrable fashion :-). What is a "non-legacy environment" anyway?

I think the world has changed significantly. In the past, setting a default of UTF-8 in your browser produced mainly bad results. But, at least according to some measures [1], UTF-8 is rapidly becoming the most reasonable default encoding on the Web. Setting a default of UTF-8 when, after examining *every* possibility, including byte-groping, you have no encoding seems as reasonable as some other randomly chosen encoding.

> > 3. We think your intention is to permit the feature most browsers
> > have of allowing the user to configure (from a base default) the
> > character encoding to use when displaying a given page. The
> sentence
> > starting "Due to its use..." mentions "predominantly Western
> > demographics", which we find troublesome, especially given that
> it
> > is associated with the keyword "recommended".
> 
> Browsers for Latin-script locales pretty much universally use
> Windows-1252 as the default of last resort. This is necessary to be
> compatible with legacy content on the existing Web.

Yes, but the point is: for non-Latin-script locales, some other encoding is typically set as the default for the same reasons. You give a solution for Latin-script locales, but not others. This is not, I think, quite complete. Hence our wording suggestions, which I think are more balanced. Note that we are not suggesting that you substantively change the requirement here.

> >
> > --
> > Otherwise, return an implementation-defined or user-specified
> > default character encoding, with the confidence tentative. The
> UTF-8
> > encoding is recommended as a default. The default may also be set
> > according to the expectations and predominant legacy content
> > encodings for a given demographic or audience. For example,
> > windows-1252 is recommended as the default encoding for Western
> > European language environments. Other encodings may also be used.
> > For example, "windows-949" might be an appropriate default in a
> > Korean language runtime environment.
> > --
> 
> I don't actually have a technical objection to this wording. But it
> seems a little misleading. It leads with the UTF-8 recommendation,
> but
> in practice that recommendation won't be used, because browsers
> will
> use windows-1252 or something local-specific, and content will
> expect
> this. 

Content expects nothing. Content just is. 

The main recommendation should be: declare the encoding. We provide ample opportunity to do so. And ample guidance on good encoding choices (i.e. UTF-8). Only (badly declared) legacy content should be at issue here.

Browsers can equally choose between UTF-8 and some local, legacy encoding when all else fails. You are probably equally likely to see mojibake garbage on the screen. [[of course, as I go on to point out, using UTF-8 as a default is kind of perverse, since you can usually know for sure if the content is not UTF-8]]

> What's the benefit of leading with a UTF-8 recommendation,
> but
> then following it with alternatives that nearly everyone will have
> to
> choose in practice?

Actually, they are not forced to choose the alternatives. If you're going to slap bytes onto the screen, one encoding is much like another. If you consider the graph in [1], you'll see that, between US-ASCII (a form of UTF-8) and UTF-8, over half the Web now uses UTF-8.

Admittedly, legacy encoded content is common and probably most prevalent when no encoding is declared. I think most vendors will choose to make some encoding or group of encodings the "default" for step 7, based on the current system locale or user agent localization. The text I18N proposes allows this without needlessly biasing the choice to any particular group of languages. This is a measured and responsible choice.

At the same time, I think UTF-8 is more than a politically correct fig leaf. The more standards and implementations stress good choices, the more likely people (users, content authors) are to take them seriously. If you happen to have chosen UTF-8 as an encoding, your pages are more likely to just work. Recommending UTF-8 as a default probably will continue to establish itself as the right choice as time progresses. Remember: this is the "all else fails" result and is exposed to user intervention by nearly all user agents.

> 
> >
> > 4. We suggest adding to step (6) this note:
> >
> > --
> > Note: The UTF-8 encoding has a highly detectable bit pattern.
> > Documents that contain bytes > 0x7F which match the UTF-8 pattern
> > are very likely to be UTF-8, while documents that do not match it
> > definitely are not. While not full autodetection, it may be
> > appropriate for a user-agent to search for this common encoding.
> > --
> 
> That suggestion makes sense.

Thanks.

[1] http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html


Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.
Received on Thursday, 20 August 2009 07:06:58 UTC