Re: HTML5 Issue 11 (encoding detection): I18N WG response...

On Mon, 12 Oct 2009, Leif Halvard Silli wrote:
> Ian Hickson On 09-10-11 21.23:
> > On Sun, 11 Oct 2009, Leif Halvard Silli wrote (reordered):
> > >
> > > The choice of character set - alphabet - for instance, has always 
> > > been a political matter, and still is.
> > 
> > Ok, then it seems sensible to use a political way of speaking to refer 
> > to the choice of alphabet.
> > 
> > > "Western this-and-that" is predominantly a political way of 
> > > speaking.
> > 
> > Good, then it is appropriate terminology.
> Appropriate for what?

For the spec. Using political ways of speaking to talk about political 

> "Western European Language [environments]" as Addison suggested is a 
> reasonable neutral term, btw, despite use of "Western". It also gives 
> the reader much more hints about what the politics involved ...

"European" has no place in this term, as far as I can tell.

> > > Therefore is wrong to use a wording that causes readers to think in 
> > > political terms.
> > 
> > But you agree that it _is_ a political matter.
> Which "it" are you referring to now?

The choice of character set - alphabet.

> "Western demographics" is a term that leaves the job of finding out 
> which those areas are to the reader, anyhow.

If we can have instead a table of languages to default encodings, I would 
much rather have that. Is the data for such a table available?

On Mon, 12 Oct 2009, Henri Sivonen wrote:
> It probably wouldn't make sense to build an exhaustive lists of locales 
> where browsers default to Windows-1252, but wouldn't it be feasible to 
> build an exhaustive list of the locales where browsers *don't* default 
> to Windows-1252 (e.g. by grepping Firefox localization files)?

If such data is available, I'd be happy to include it instead of the 
current text.

On Sun, 11 Oct 2009, Mark Davis ~X~U wrote:
> But focusing on advice to developers, I'd suggest replacing 6 and 7 in 
> by the following 3 numbered items.
>    - Test if the bytes are valid UTF-8. If they are, return return that
>    encoding, with the
> confidence<>
>    *tentative*, and abort these steps.
>       - *[include note about UTF-8 patterns, maybe reworded a bit.]*
>    - The user agent may attempt to autodetect the character encoding *[include
>    rest of #5]*
>    - Otherwise, return an implementation-defined or user-specified default
>    character encoding, with the
> confidence<>
>    *tentative*. Due to its widespread use as a default in legacy content,
>    windows-1252 is recommended as a default in the absences of other
>    information.

On Mon, 12 Oct 2009, Henri Sivonen wrote:
> So you are suggesting making UTF-8 autodetect mandatory while leaving 
> the rest of chardet optional? Does any one of the 5 top browsers do 
> that?

Mark, could you elaborate on your reasoning for this proposal and on the 
intent of browser vendors to follow those requirements?

On Mon, 12 Oct 2009, Maciej Stachowiak wrote:
> On Oct 11, 2009, at 12:23 PM, Ian Hickson wrote:
> > 
> > What phrase best approximates the areas of the world where _today_ UAs 
> > are shipping with a 1252 default encoding?
> "locales that predominantly use the Latin script"

Given that 1252 is the Latin script, and seem circular.

> Or you could say:
> "locales that predominantly use the Latin script, and whose primary 
> languages are completely or almost completely covered by Windows-1252."

I'd rather just have an explicit table, if we can.

> Note: in the browsers that vary this, it is always determined by 
> "locale", not "demographic" (which is not a computing concept). I don't 
> think using the term "demographic" makes sense in this context.

Fair enough. Changed to "locale".

Ian Hickson               U+1047E                )\._.,--....,'``.    fL       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 12 October 2009 11:34:55 UTC