Re: HTML5 Issue 11 (encoding detection): I18N WG response...

On Sun, 30 Aug 2009 02:37:13 +0000 (UTC) Ian Hickson wrote:
 >>On Wed, 19 Aug 2009, Phillips, Addison wrote:

> > We remain concerned about the text in Step 7 in this section:
> >    
> > 
> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
> > Our concerns about this text are:
> >
> > 1. It isn't clear what constitutes a "legacy" or "non-legacy
> > environment".
>
> The Web is a legacy environment. Non-legacy environments are new walled
> gardens.

Any new document created on a computer is thus a non-legacy 
environment.  This should be pointed out. Should file:// URL  come into 
consideration as "non-legacy"? Should date of creation come into 
consideration come into consideration?


> > The sentence starting "Due to its use..." mentions "predominantly
> > Western demographics", which we find troublesome, especially given that
> > it is associated with the keyword "recommended".
>
> Why?

I agree with Addison that the text is unclear. Some comments on the 
following paragraph:

<txt>Otherwise, return an implementation-defined or user-specified 
default character encoding, with the confidence tentative. In non-legacy 
environments, the more comprehensive UTF-8 encoding is recommended. Due 
to its use in legacy content, windows-1252 is recommended as a default 
in predominantly Western demographics instead.</txt>

(1) The text says "... windows-1252 is recommended as a default ....". I 
also suggest saying "... UTF-8 encoding is recommended _as_a_default._" 
in the preceding sentence about "non-legacy environments".

(2) The last word is "... instead". In your debate with Addison, you 
seem to draw a clear line between "legacy" and "non-legacy". But here, 
in the text, the word -  "instead" - seems to link back to the sentence 
about "non-legacy" content, thus making it seem as there is a link from 
"non-legacy" to "legacy". Hence the advice could be interpreted like 
this: "for legacy content - but only for 'Western' legacy content, then 
windows-1252 is recommended as a default, instead of using the 
recommended non-legacy environment encoding". Another possible 
interpretation: "for Western _non-legacy_ environments, then due to the 
state of Western _legacy_ _content_, win-1252 is actually recommended as 
default" ...

(3) I suggest replacing the phrase "Western demographics".  Because, 
"Western" is a political word. For instance, Japan is sometimes referred 
to as "Western".  In addition, the phrase "Western demographics" is used 
less than thousand times on the Web, according to Google [1]. I don't 
think it is necessary to invent a phrase to express what is meant here. 
(Are there any "demographics" that predominantly use the Western 
European character set, but for which the Windows-1252 is _not_ 
recommended as fallback encoding?)

(4) Should it not be mentioned that other defaults may be recommended - 
whether specified or not - for non-Western Latin hemispheres?

(5) You talk about "non-legacy environments" versus "legacy content". I 
wonder what the difference between "environment" and "content" is.  Is 
"legacy content" the same as  "old" content? Can timestamps be used to 
deciding the best default ...?

(6) "Default" in plain English means "fallback". Can "fallback" be used 
instead of "default" in this paragraph? (And, by any means - wherever, 
if you wish.) Default has so many unlucky interpretations ... For 
instance, some might interpret "Windows 1252 is default for Western 
European languages" as "Windows 1252 is the recommended encoding Western 
European languages".

(7) I wonder if "locale" or localization could be used instead of 
"demographics". The text speaks about about "implementation-defined or 
use-specified" fallback encoding. Browser vendors will perhaps need to 
define "demographics". But this is a seldom used word. Locale - or 
locales (plural) is much more well known, to most parties, I think.

All in all, here is a suggested improvement, as far as I've understood ...

<txt>Otherwise, return an implementation-defined or user-specified 
fallback encoding, with the confidence tentative. In non-legacy 
environments, the more comprehensive UTF-8 encoding <ins>is recommended 
as a fallback encoding</ins>. <ins>For</in> legacy content, <ins>then 
the dominating legacy encoding of one or several text encoding related 
locale(s), is often recommendable as a fallback encoding. For instance, 
for legacy content of locales that predominantly use the Western 
European Latin character set, then </ins> Windows-1252 is recommended as 
a fallback encoding.</txt>
-- 
leif halvard silli

Received on Monday, 31 August 2009 09:24:38 UTC