RE: HTML5 Issue 11 (encoding detection): I18N WG response... from Phillips, Addison on 2009-08-31 (public-html@w3.org from August 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Mon, 31 Aug 2009 10:40:30 -0700
To: Ian Hickson <ian@hixie.ch>, Maciej Stachowiak <mjs@apple.com>, Henri Sivonen <hsivonen@iki.fi>, Anne van Kesteren <annevk@opera.com>, Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>
CC: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01AD384E7C@EX-SEA5-D.ant.amazon.com>

Hello Hixie,

A response follows. This is a personal response rather than on behalf of the I18N WG, as we haven't met yet.

> >
> > Our concerns about this text are:
> >
> > 1. It isn't clear what constitutes a "legacy" or "non-legacy
> > environment".
> 
> The Web is a legacy environment. Non-legacy environments are new
> walled gardens.

I understand. But I think that the average reader might not. Using UTF-8 as a default authoring choice in your walled garden is a Good Thing, but really this would make a better FAQ or Best Practice than a "recommended" case? If your goal is to inform people writing user agents for these cases, perhaps say instead:

--
In controlled environments or in cases where the encoding of documents can be prescribed, the UTF-8 character encoding is recommended.
--

> 
> > The sentence starting "Due to its use..." mentions "predominantly
> > Western demographics", which we find troublesome, especially
> given that it is associated with the keyword "recommended".
> 
> Why?

This is really two points. 

First, I think that the demographics phrase isn't very well defined or is imprecise. You should be more specific with the recommendation so that implementers will know how to evaluate it. The problem here, as we've discussed before, is that down this path is a list of recommendations (one per "demographic"), something that I think better to avoid in HTML5.

Second, "recommended" is a 2119 keyword with a normative meaning. While windows-1252 is probably the best default when the users are most often accessing Latin-1 resources, it really should be an example of how an implementation-defined default is chosen (user defined defaults being the user's business). "Western demographics" combined with this normative meaning might produce confusion (is Poland, a primarily Latin-2 environment, Western? Etc. etc.). You are trying to say that the best default for various language/regional audiences depends on the audience. Browsers in the main do the right thing here, keying off system locale or browser localization. 

Elsewhere in this thread, Richard or I have proposed language that I think addresses this well (by making the items examples and covering additional audiences). Ultimately, it would probably be useful for the I18N WG to provide a reference about legacy encodings, superset encodings, and the famous "willful violation of CharMod" in HTML5, published as a WG Note. Documenting what the various defaults are for various browsers might be useful to other implementers needing to address this issue in their own code. 


> 
> I haven't added this, as I don't want this step to turn into a long
> list
> of possible algorithms to use. However, if you have other papers I
> should
> reference in addition to [UNIVCHARDET], I'm happy to add references.

I don't think you should add a lot of possible algorithms. It is just that the special nature of UTF-8 and the relative simplicity of bit-sniffing for it is a useful strategy, at least on the server side. I suggested a special mention, given that I have seen browser vendors saying that they are removing the optional step 6 support as time goes on. If browsers don't do full chardet, they may still get some utility by including the UTF-8 sniff. I'll dig up an appropriate reference if you prefer. 

My real issue was that in step 6 you allowed for bit sniffing. And then you allow it again with:

> Since these encodings can 
> in many cases be distinguished by inspection, a user agent may 
> heuristically decide which to use as a default.

If what you meant to suggest here was that the default might be something like "Japanese auto-detect", you should probably say that more directly.

However, it's not that important.

> >
> > http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381

> > "Clarify default encoding wording and add some examples for non-
> latin locales."
> 
> Thanks. I will get to these in due course.

Thanks. Please let I18N WG know if we can assist you with this. I think that the text suggested further down the thread marks a useful improvement both on the existing text and on the original proposal.

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Monday, 31 August 2009 17:41:15 UTC