RE: HTML5 Issue 11 (encoding detection): I18N WG response... from Ian Hickson on 2009-10-04 (public-i18n-core@w3.org from October to December 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Sun, 4 Oct 2009 11:28:35 +0000 (UTC)
To: "Phillips, Addison" <addison@amazon.com>
Cc: Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <Pine.LNX.4.62.0910041118450.25383@hixie.dreamhostps.com>

On Mon, 31 Aug 2009, Phillips, Addison wrote:
> > >
> > > Our concerns about this text are:
> > >
> > > 1. It isn't clear what constitutes a "legacy" or "non-legacy 
> > > environment".
> > 
> > The Web is a legacy environment. Non-legacy environments are new 
> > walled gardens.
> 
> I understand. But I think that the average reader might not. Using UTF-8 
> as a default authoring choice in your walled garden is a Good Thing, but 
> really this would make a better FAQ or Best Practice than a 
> "recommended" case? If your goal is to inform people writing user agents 
> for these cases, perhaps say instead:
> 
> --
> In controlled environments or in cases where the encoding of documents 
> can be prescribed, the UTF-8 character encoding is recommended.
> --

Ok, I've changed "non-legacy" to something more like the above.


> > > The sentence starting "Due to its use..." mentions "predominantly 
> > > Western demographics", which we find troublesome, especially given 
> > > that it is associated with the keyword "recommended".
> > 
> > Why?
> 
> This is really two points.
> 
> First, I think that the demographics phrase isn't very well defined or 
> is imprecise. You should be more specific with the recommendation so 
> that implementers will know how to evaluate it. The problem here, as 
> we've discussed before, is that down this path is a list of 
> recommendations (one per "demographic"), something that I think better 
> to avoid in HTML5.

This seems to be two problems:

- "Western demographics" not being very clear for implementors. In 
practice, I think implementors understand this pretty well, so I'm not 
convinced that's a problem.

- The slippery slope of needing to define this for all demographic. I 
would actually like to include details for other major demographies, but I 
don't think there's a slippery slope here, given that in the years of this 
text being present, we have not added requirements for other demographies.


> Second, "recommended" is a 2119 keyword with a normative meaning. While 
> windows-1252 is probably the best default when the users are most often 
> accessing Latin-1 resources, it really should be an example of how an 
> implementation-defined default is chosen (user defined defaults being 
> the user's business). "Western demographics" combined with this 
> normative meaning might produce confusion (is Poland, a primarily 
> Latin-2 environment, Western? Etc. etc.). You are trying to say that the 
> best default for various language/regional audiences depends on the 
> audience. Browsers in the main do the right thing here, keying off 
> system locale or browser localization.

I've changed "recommended" to "suggested".


> > I haven't added this, as I don't want this step to turn into a long 
> > list of possible algorithms to use. However, if you have other papers 
> > I should reference in addition to [UNIVCHARDET], I'm happy to add 
> > references.
> 
> I don't think you should add a lot of possible algorithms. It is just 
> that the special nature of UTF-8 and the relative simplicity of 
> bit-sniffing for it is a useful strategy, at least on the server side. I 
> suggested a special mention, given that I have seen browser vendors 
> saying that they are removing the optional step 6 support as time goes 
> on. If browsers don't do full chardet, they may still get some utility 
> by including the UTF-8 sniff. I'll dig up an appropriate reference if 
> you prefer.

If you have a reference for this, that would be preferable, yes. Thanks.


> My real issue was that in step 6 you allowed for bit sniffing. And then 
> you allow it again with:
> 
> > Since these encodings can in many cases be distinguished by 
> > inspection, a user agent may heuristically decide which to use as a 
> > default.
> 
> If what you meant to suggest here was that the default might be 
> something like "Japanese auto-detect", you should probably say that more 
> directly.
> 
> However, it's not that important.

I've since removed that quoted text.


> > > http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
> > > "Clarify default encoding wording and add some examples for non-
> > > latin locales."
> > 
> > Thanks. I will get to these in due course.
> 
> Thanks. Please let I18N WG know if we can assist you with this. I think 
> that the text suggested further down the thread marks a useful 
> improvement both on the existing text and on the original proposal.

This bug is currently awaiting elaboration from the reporter.

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 4 October 2009 11:19:40 UTC