W3C home > Mailing lists > Public > public-html@w3.org > October 2009

Re: HTML5 Issue 11 (encoding detection): I18N WG response...

From: Ian Hickson <ian@hixie.ch>
Date: Mon, 12 Oct 2009 11:45:29 +0000 (UTC)
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Mark Davis ☕ <mark@macchiato.com>, Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>
Cc: Martin J. Drst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Message-ID: <Pine.LNX.4.62.0910121138010.25383@hixie.dreamhostps.com>
On Mon, 12 Oct 2009, Leif Halvard Silli wrote:
> Ian Hickson On 09-10-11 21.23:
> > On Sun, 11 Oct 2009, Leif Halvard Silli wrote (reordered):
> > >
> > > The choice of character set - alphabet - for instance, has always 
> > > been a political matter, and still is.
> > 
> > Ok, then it seems sensible to use a political way of speaking to refer 
> > to the choice of alphabet.
> > 
> > > "Western this-and-that" is predominantly a political way of 
> > > speaking.
> > 
> > Good, then it is appropriate terminology.
> 
> Appropriate for what?

For the spec. Using political ways of speaking to talk about political 
matters.


> "Western European Language [environments]" as Addison suggested is a 
> reasonable neutral term, btw, despite use of "Western". It also gives 
> the reader much more hints about what the politics involved ...

"European" has no place in this term, as far as I can tell.


> > > Therefore is wrong to use a wording that causes readers to think in 
> > > political terms.
> > 
> > But you agree that it _is_ a political matter.
> 
> Which "it" are you referring to now?

The choice of character set - alphabet.


> "Western demographics" is a term that leaves the job of finding out 
> which those areas are to the reader, anyhow.

If we can have instead a table of languages to default encodings, I would 
much rather have that. Is the data for such a table available?


On Mon, 12 Oct 2009, Henri Sivonen wrote:
> 
> It probably wouldn't make sense to build an exhaustive lists of locales 
> where browsers default to Windows-1252, but wouldn't it be feasible to 
> build an exhaustive list of the locales where browsers *don't* default 
> to Windows-1252 (e.g. by grepping Firefox localization files)?

If such data is available, I'd be happy to include it instead of the 
current text.


On Sun, 11 Oct 2009, Mark Davis ~X~U wrote:
> 
> But focusing on advice to developers, I'd suggest replacing 6 and 7 in 
> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding, 
> by the following 3 numbered items.
> 
>    - Test if the bytes are valid UTF-8. If they are, return return that
>    encoding, with the
> confidence<http://dev.w3.org/html5/spec/Overview.html#concept-encoding-confidence>
>    *tentative*, and abort these steps.
>       - *[include note about UTF-8 patterns, maybe reworded a bit.]*
>    - The user agent may attempt to autodetect the character encoding *[include
>    rest of #5]*
>    - Otherwise, return an implementation-defined or user-specified default
>    character encoding, with the
> confidence<http://dev.w3.org/html5/spec/Overview.html#concept-encoding-confidence>
>    *tentative*. Due to its widespread use as a default in legacy content,
>    windows-1252 is recommended as a default in the absences of other
>    information.

On Mon, 12 Oct 2009, Henri Sivonen wrote:
> 
> So you are suggesting making UTF-8 autodetect mandatory while leaving 
> the rest of chardet optional? Does any one of the 5 top browsers do 
> that?

Mark, could you elaborate on your reasoning for this proposal and on the 
intent of browser vendors to follow those requirements?


On Mon, 12 Oct 2009, Maciej Stachowiak wrote:
> On Oct 11, 2009, at 12:23 PM, Ian Hickson wrote:
> > 
> > What phrase best approximates the areas of the world where _today_ UAs 
> > are shipping with a 1252 default encoding?
> 
> "locales that predominantly use the Latin script"

Given that 1252 is the Latin script, and seem circular.


> Or you could say:
> 
> "locales that predominantly use the Latin script, and whose primary 
> languages are completely or almost completely covered by Windows-1252."

I'd rather just have an explicit table, if we can.


> Note: in the browsers that vary this, it is always determined by 
> "locale", not "demographic" (which is not a computing concept). I don't 
> think using the term "demographic" makes sense in this context.

Fair enough. Changed to "locale".

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 12 October 2009 11:34:55 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:50 GMT