Re: Auto-detect and encodings in HTML5 from Sorin Sbarnea on 2009-06-02 (public-html@w3.org from June 2009)

From: Sorin Sbarnea <sorin.sbarnea@gmail.com>
Date: Tue, 2 Jun 2009 16:33:42 +0300
To: Henri Sivonen <hsivonen@iki.fi>
Cc: Larry Masinter <masinter@adobe.com>, Chris Wilson <Chris.Wilson@microsoft.com>, Maciej Stachowiak <mjs@apple.com>, "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Message-ID: <d77cce480906020633m277c6cc1wf69c35378d944a9f@mail.gmail.com>

I think that defining the default encoding as UTF-8 in HTML 5 is a
great opportunity to solve the endless collection of problems
generated by the existence of codepages.

The possible issues regarding backwards compatibility on old browsers
are less important than the benefits the industry would have by
'dropping' the codepages. When it comes to standards we should not let
space for interpretations like what to do when the charset is not
specified - heuristics was an just an workaround.

Many programmers are not thinking globally and by defining the default
encoding as being UTF-8 we'll educate them to use a correct encoding
instead of using some limited charset encoding. Just allowing the
people to do the right thing is not enough, they should be able to do
this by default (without knowing it).

We should not miss the chance of doing something good by putting a
small nail in the codepages coffin.

--
Sorin Sbârnea
http://blog.i18n.ro


On Tue, Jun 2, 2009 at 11:10 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> On Jun 1, 2009, at 20:44, Larry Masinter wrote:
>
>> Chris, in your note below you claim that the "current de facto" value was
>> "Win1252"
>> which seems to contradict what I thought was claimed in another message
>> that the
>> "de facto" default was "unknown" (which was my understanding, i.e., that
>> browsers
>> used a wide variety of heuristics to determine charset).
>
> The de facto default is Windows-1252 except for locales where it isn't. If a
> user mostly browses pages written in Simplified Chinese, it makes sense to
> make GBK the default (GBK is to GB2312 what Windows-1252 is to ISO-8859-1)
> at least when heuristics are turned off.
>
> At least for the U.S. locale, Firefox and IE8 default to heuristics off. The
> user can enable heuristics for CJK and Cyrillic (in various groupings: all,
> both kinds of Chinese only, only Simplified Chinese, only Russian, etc.).
> Firefox (but not IE8) also supports a grouping for all of CJK (excluding
> Cyrillic).
>
> In Opera, the heuristic detector may be enabled for any one of C, J and K or
> Cyrillic. (Opera doesn't seem to have a Russian-only, Ukranian-only,
> Traditional Chinese-only or Simplified Chinese-only modes.) It's unclear if
> Opera's default behavior is heuristics off or universal heuristics on.
> Perhaps someone from Opera could enlighten us.
>
> Safari doesn't have heuristic selection UI. It is unclear if Safari has no
> heuristics or whether it has always-on heuristics by default. Chrome's UI is
> slightly differently ambiguous.
>
> (For clarity: Above I'm using the word "heuristics" to mean exclusively the
> frequency and chaining analysis on bytes.)
>
>> I'm interested in reducing ambiguity and making web transactions more
>> reliable,
>> and associating a new version indicator (DOCTYPE) with a more constrained
>> default
>> (charset default UTF8, rather than 'unknown') is reasonable, while I also
>> would
>> be opposed to making an incompatible change with actual current behavior..
>
>
> We already have 3 reliable version indicators for encoding axis of
> versioning:
> charset=utf-8 on the HTTP layer
> charset=utf-8 in <meta>
> the UTF-8 BOM
>
> We don't need a new indicator that wouldn't be as compatible with existing
> user agents as the indicators we already have. (Consider the Degrade
> Gracefully principle.)
>
> On Jun 2, 2009, at 03:48, Leif Halvard Silli wrote:
>
>> Is it the choice of UTF-8 as default you don't understand? If so, then I'd
>> like to quote the "Support World Languages" principle.
>
> The Support World Languages principle is satisfied by HTML5 allowing authors
> easily to opt in to UTF-8. It has to be opt in due to the Support Existing
> Content and Degrade Gracefully principles.
>
> --
> Henri Sivonen
> hsivonen@iki.fi
> http://hsivonen.iki.fi/
>
>
>
>

Received on Tuesday, 2 June 2009 15:09:58 UTC