Re: Auto-detect and encodings in HTML5

2009/6/2 Henri Sivonen <hsivonen@iki.fi>:
> On Jun 1, 2009, at 20:44, Larry Masinter wrote:
>
>> Chris, in your note below you claim that the "current de facto" value was
>> "Win1252"
>> which seems to contradict what I thought was claimed in another message
>> that the
>> "de facto" default was "unknown" (which was my understanding, i.e., that
>> browsers
>> used a wide variety of heuristics to determine charset).
>
> The de facto default is Windows-1252 except for locales where it isn't. If a
> user mostly browses pages written in Simplified Chinese, it makes sense to
> make GBK the default (GBK is to GB2312 what Windows-1252 is to ISO-8859-1)
> at least when heuristics are turned off.


That's what Firefox, Chrome and IE do. French/English/German/Swedish
Firefox have set it to windows-1252 (actually iso-8859-1, but on
reading, Firefox treats ISO-8859-1 the same as windows-1252 as HTML5
specifies) while SC Firefox/Chrome has it set to GBK, Russian
Firefox/Chrome has it set to windows-1255, Japanese Firefox/Chrome has
it set to Shift_JIS.

Moreover, Firefox, Chrome and Safari offer a UI to change the default
encoding/charset to assume when there's no other information is
available AND the encoding detector is OFF (well, Safari does not have
the encoding detector except for the built-in Japanese encoding
detector which kicks in whenever it comes across a page claimed to be
encoded in one of Japanese encodings, which I think is not so good an
idea because a user has no control. See
https://bugs.webkit.org/show_bug.cgi?id=21990 )

In my case, the UI language of my browser is usually English. So out
of the box, the default encoding is iso-8859-1/windows-1252, but I set
the default encoding to EUC-KR/Win949 because the vast majority of
pages I visit are either in Korean (encoded in EUC-KR with our without
charset specified) or English (mostly in ASCII with occasional
non-ASCII characters. EUC-KR being ASCII compatible, I don't have to
change the encoding even when no encoding is declared in most cases).
The same can be said of SChinese, TChinese, Thai speakers.  For the
speakers of languages with multiple encodings in wide use, it can be a
bit trickier, but even for them, there's a single most widely used
encodings (e.g windows-1251 for Russian, windows-1256 for Arabic,
windows-1255 for Hebrew and Shift_JIS for Japanese).

So, the claim that WinLatin1 is de facto default does not hold. (it's
rather Western-Euro-centric ;-) ).

>
> At least for the U.S. locale, Firefox and IE8 default to heuristics off. The
> user can enable heuristics for CJK and Cyrillic (in various groupings: all,
> both kinds of Chinese only, only Simplified Chinese, only Russian, etc.).
> Firefox (but not IE8) also supports a grouping for all of CJK (excluding
> Cyrillic).

I'm not sure of IE's encoding detection behavior. Even when it's off,
sometimes it does do the encoding detection at least in the past.
Otherwise, it'd not have had a trouble with UTF-7 security issue.

Firefox never autodetected UTF-7 IIRC. Neither does Chrome.

> In Opera, the heuristic detector may be enabled for any one of C, J and K or
> Cyrillic. (Opera doesn't seem to have a Russian-only, Ukranian-only,
> Traditional Chinese-only or Simplified Chinese-only modes.) It's unclear if
> Opera's default behavior is heuristics off or universal heuristics on.
> Perhaps someone from Opera could enlighten us.
>
> Safari doesn't have heuristic selection UI. It is unclear if Safari has no
> heuristics or whether it has always-on heuristics by default. Chrome's UI is
> slightly differently ambiguous.

Chrome is the same as Firefox except that it does not have
locale-specific encoding detectors. (it has just a single universal
encoding detector based on ICU encoding detection API. Actually, the
code is checked into Webkit so that any webkit-based browser can use
the facility. ) and the encoding detector is OFF by default for all
the locales.



> (For clarity: Above I'm using the word "heuristics" to mean exclusively the
> frequency and chaining analysis on bytes.)
>
>> I'm interested in reducing ambiguity and making web transactions more
>> reliable,
>> and associating a new version indicator (DOCTYPE) with a more constrained
>> default
>> (charset default UTF8, rather than 'unknown') is reasonable, while I also
>> would
>> be opposed to making an incompatible change with actual current behavior.
>
>
> We already have 3 reliable version indicators for encoding axis of
> versioning:
> charset=utf-8 on the HTTP layer
> charset=utf-8 in <meta>
> the UTF-8 BOM
>
> We don't need a new indicator that wouldn't be as compatible with existing
> user agents as the indicators we already have. (Consider the Degrade
> Gracefully principle.)

I guess I agree with you here.

Jungshik

>
> On Jun 2, 2009, at 03:48, Leif Halvard Silli wrote:
>
>> Is it the choice of UTF-8 as default you don't understand? If so, then I'd
>> like to quote the "Support World Languages" principle.
>
> The Support World Languages principle is satisfied by HTML5 allowing authors
> easily to opt in to UTF-8. It has to be opt in due to the Support Existing
> Content and Degrade Gracefully principles.
>
> --
> Henri Sivonen
> hsivonen@iki.fi
> http://hsivonen.iki.fi/
>
>
>
>

Received on Saturday, 13 June 2009 00:30:00 UTC