Bug 7381 (default encoding selection) from Richard Ishida on 2009-08-20 (public-i18n-core@w3.org from July to September 2009)

From: Richard Ishida <ishida@w3.org>
Date: Thu, 20 Aug 2009 10:46:44 +0100
To: "'Maciej Stachowiak'" <mjs@apple.com>, "'Phillips, Addison'" <addison@amazon.com>
Cc: <public-html@w3.org>, <public-i18n-core@w3.org>
Message-ID: <003501ca217b$21032a30$63097e90$@org>
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
> "Clarify default encoding wording and add some examples for non-latin
locales."

So building on that, here's a first comment on 7381: 

I think that unless the word 'legacy' is specifically defined for this use
in the HTML5 we still need to clarify it. (Especially as in Charmod,
'legacy' is used to refer to non-Unicode encodings, which may further
confuse).  Building on Henri's explanation, how about this wording:


Otherwise, return an implementation-defined or user-specified default
character encoding, with the confidence tentative. In controlled
environments,
the more comprehensive UTF-8 encoding is recommended. For the wider Web, 
the default may be set according to the
expectations and predominant content encodings for a given demographic
or audience. For example, windows-1252 is recommended as the default
encoding
for Western European language environments. Other encodings may also be
used.
For example, "windows-949" might be an appropriate default in a Korean
language
runtime environment.

[1] We could add to the end ", and UTF-8 would be an appropriate default for
scripts in many developing regions."  I suggest this, not because I want to
see utf-8 go for world wide web domination or because I see it as a global
panacea, but because I think it helps for certain demographics or audience.
The situation in these regions is often mired in competing encodings each
with a non-majority user base, that impede general interoperability, and use
of utf-8 tends to provide a way forward - not only by superceding other
encoding schemes, but also typically by providing useful features that
support the use of the script itself.  I just don't want it to sound as if
you should try to find a local encoding for the default in every
circumstance.

[2] I think it may also be worthwhile noting that the default encoding may
also be that explicitly set by users in some applications (eg. Firefox and
IE allow you to change the default encoding).

Hope that helps,
RI


============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/



From: public-i18n-core-request@w3.org
[mailto:public-i18n-core-request@w3.org] On Behalf Of Maciej Stachowiak
Sent: 20 August 2009 08:33
To: Phillips, Addison
Cc: public-html@w3.org; public-i18n-core@w3.org
Subject: Re: HTML5 Issue 11 (encoding detection): I18N WG response...


Based on further discussion with you and Henri, I filed the following:

http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380
"Suggest heuristic detection of UTF-8"


http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
"Clarify default encoding wording and add some examples for non-latin
locales."


Would you be willing to close ISSUE-11 in favor of the above two bugs?


Regards,
Maciej

On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote:


Dear HTML5,

The I18N Core WG would like to respond to your issue located here:

  http://www.w3.org/html/wg/tracker/issues/11

We remain concerned about the text in Step 7 in this section:

  http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encod
ing

Your current text reads:

--
Otherwise, return an implementation-defined or user-specified default
character encoding, with the confidence  tentative. In non-legacy
environments, the more comprehensive UTF-8 encoding is recommended. Due to
its use in legacy content, windows-1252 is recommended as a default in
predominantly Western demographics instead. Since these encodings can in
many cases be distinguished by inspection, a user agent may heuristically
decide which to use as a default.
--

Our concerns about this text are:

1. It isn't clear what constitutes a "legacy" or "non-legacy environment".
We think that, for modern implementations, a bare recommendation of UTF-8
would be preferable.

2. The sentence starting "Since these encodings can {...} be distinguished
by inspection" is not really accurate. If the user agent has performed the
optional step (6), then heuristic detection has already been applied and
failed. If the user agent has not done step (6), then the only reasonable
encoding that can reliably be detected based solely on bit-pattern is UTF-8.

3. We think your intention is to permit the feature most browsers have of
allowing the user to configure (from a base default) the character encoding
to use when displaying a given page. The sentence starting "Due to its
use..." mentions "predominantly Western demographics", which we find
troublesome, especially given that it is associated with the keyword
"recommended".

We would like to request that you reword this paragraph along the lines of
something like:

--
Otherwise, return an implementation-defined or user-specified default
character encoding, with the confidence tentative. The UTF-8 encoding is
recommended as a default. The default may also be set according to the
expectations and predominant legacy content encodings for a given
demographic or audience. For example, windows-1252 is recommended as the
default encoding for Western European language environments. Other encodings
may also be used. For example, "windows-949" might be an appropriate default
in a Korean language runtime environment. 
--

4. We suggest adding to step (6) this note:

--
Note: The UTF-8 encoding has a highly detectable bit pattern. Documents that
contain bytes > 0x7F which match the UTF-8 pattern are very likely to be
UTF-8, while documents that do not match it definitely are not. While not
full autodetection, it may be appropriate for a user-agent to search for
this common encoding.
--

Addison (for I18N WG)

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.
Received on Thursday, 20 August 2009 09:46:57 UTC