Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Ian Hickson on 2009-10-11 (public-html@w3.org from October 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Sun, 11 Oct 2009 08:57:22 +0000 (UTC)
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>
Cc: Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <Pine.LNX.4.62.0910110840041.25383@hixie.dreamhostps.com>
On Mon, 5 Oct 2009, "Martin J. Dürst" wrote:
> On 2009/10/05 16:59, Martin J. Dürst wrote:
> > 
> > The presentation that explained this for the first time and in great 
> > detail is at:
> > 
> > http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
> > 
> > The Properties and Promises of UTF-8, Martin J. Dürst, 11th 
> > International Unicode Conference, San Jose, CA, USA, September 1997
>
> In addition, the regular expression at 
> http://www.w3.org/International/questions/qa-forms-utf-8 is also of 
> interest/help. It incorporates checks against overlong encodings and 
> such that are not discussed in the original paper.

Thanks, added.


On Wed, 7 Oct 2009, Phillips, Addison wrote:
> > 
> > This seems to be two problems:
> > 
> > - "Western demographics" not being very clear for implementors. In 
> > practice, I think implementors understand this pretty well, so I'm not 
> > convinced that's a problem.
> 
> I feel that the terminology is not very useful as written. It provides 
> what appears to be normative guidance but conveys no useful information 
> about what a Western demographic might be. While the major browser 
> implementers probably understand what you're getting at, future readers 
> of this text must deal with this and their understanding may or may not 
> match current implementer's abilities and understanding.

It's not clear to me what else could be said that would be as useful but 
more precise.


> > - The slippery slope of needing to define this for all demographic.
> 
> We disagree. The slippery slope isn't so much the problem as the fact 
> that "demographic" is the wrong way to address it. Furthermore, there is 
> no need for HTML5 to busy itself defining these. There are many other 
> places where choices are left up to the implementer. This is no 
> different. The normative text here exists to permit these choices.

I think it's important that we mention Win1252 as being a good default in 
many European and American countries, Australasia, and much of Africa. 
Without being so specific as to name continents (which would just lead to 
people saying that the list was wrong), I don't really know what to say 
other than "The West" or "Western demographic".


> > > > > http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
> > 
> > This bug is currently awaiting elaboration from the reporter.
> 
> This email thread contains, in the opinion of the Internationalization 
> Core WG, the necessary elaboration. We think you should adopt verbatim 
> either the text Richard proposed in:
> 
>   http://lists.w3.org/Archives/Public/public-html/2009Aug/1040.html
> 
> Or the slightly modified version I proposed in:
> 
>   http://lists.w3.org/Archives/Public/public-html/2009Aug/1051.html
> 
> ... both of which reference this bug.
> 
> Please let us know your thoughts on how to resolve this bug.

I don't think the e-mails above answer the question asked in the bug: 
what's wrong with the current text that is solved by the proposed text?

The spec now says:

> Otherwise, return an implementation-defined or user-specified default 
> character encoding, with the confidence tentative. In controlled 
> environments or in environments where the encoding of documents can be 
> prescribed (for example, for user agents intended for dedicated use in 
> new networks), the more comprehensive UTF-8 encoding is suggested. Due 
> to its use in legacy content, windows-1252 is suggested as a default in 
> predominantly Western demographics instead.

The first e-mail above suggests using:

> Otherwise, return an implementation-defined or user-specified default 
> character encoding, with the confidence tentative. In controlled 
> environments, the more comprehensive UTF-8 encoding is recommended. For 
> the wider Web, the default may be set according to the expectations and 
> predominant content encodings for a given demographic or audience. For 
> example, windows-1252 is recommended as the default encoding for Western 
> European language environments. Other encodings may also be used. For 
> example, "windows-949" might be an appropriate default in a Korean 
> language runtime environment.

This boils down to the following changes:

1. Remove "or in environments where the encoding of documents can be 
prescribed (for example, for user agents intended for dedicated use in new 
networks)" from the second sentence.

2. Make the requirement to use windows-1252 a conformance requirement. (It 
uses the RFC2119 "recommended", though confusingly the suggested text 
starts with the non-normative "for example", so it's not clear what was 
intended here.)

3. Remove the explanation of _why_ windows-1252 is suggested ("Due to its 
use in legacy content" in the current text).

4. Change "in predominantly Western demographics" to "for Western European 
language environments".

5. Add an RCF2119 "may", contradicting the earlier RFC2119 "recommended", 
allowing the requirement adding in #2 above to be ignored. (Adding "Other 
encodings may also be used.", which is unnecessary in the current text 
since the spec requires that the UA "return an implementation-defined or 
user-specified default character encoding", which allows any encoding, and 
doesn't require Win1252, but only suggests it.)

6. Add as a non-commital example ("For example", "might be") the encoding 
Window-949 for Korean environments.

7. Use the phrase "Korean language runtime environment".

I basically don't understand the reasoning for any of these requests. #1 
seems to contradict earlier feedback from I18N requesting more detail 
about where UTF-8 could be used. #2 and #5 seem to be outright misuse of 
RFC2119 terminology. #2 seems wrong; the whole point of this step is not 
to have a requirement. #3 seems like removing otherwise useful text. #4 
seems wrong, since Win1252 is useful for many more Western demographics 
than just Western Europe (for example it's the right choice in New Zealand 
also). #6 seems like poor wording; if we are to include advice, it should 
be advice we're confident in. Are we confident in Win949 for Korea? #7 
seems like misuse of the word "runtime".

I'd much rather see a clear statement of what is wrong with the current 
text. What problem would changes to the current text be solving?

Thanks,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Sunday, 11 October 2009 08:47:02 UTC