Re: Auto-detect and encodings in HTML5 from Leif Halvard Silli on 2009-06-02 (public-html@w3.org from June 2009)

From: Leif Halvard Silli <lhs@malform.no>
Date: Tue, 02 Jun 2009 18:17:33 +0200
To: Henri Sivonen <hsivonen@iki.fi>
CC: Larry Masinter <masinter@adobe.com>, Chris Wilson <Chris.Wilson@microsoft.com>, Maciej Stachowiak <mjs@apple.com>, "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Message-ID: <4A25509D.9040004@malform.no>

Henri Sivonen On 09-06-02 10.10:
> On Jun 1, 2009, at 20:44, Larry Masinter wrote:
> 
>> Chris, in your note below you claim that the "current de
>> facto" value was "Win1252" which seems to contradict what I
>> thought was claimed in another message that the "de facto"
>> default was "unknown" (which was my understanding, i.e., that
>> browsers used a wide variety of heuristics to determine
>> charset).
> 
> The de facto default is Windows-1252 except for locales where
> it isn't. [...]

>> I'm interested in reducing ambiguity and making web
>> transactions more reliable, and associating a new version
>> indicator (DOCTYPE) with a more constrained default (charset
>> default UTF8, rather than 'unknown') is reasonable, while I 
>> also would be opposed to making an incompatible change with
>> actual current behavior.
> 
> We already have 3 reliable version indicators for encoding axis
> of versioning: charset=utf-8 on the HTTP layer charset=utf-8 in
> <meta> the UTF-8 BOM
> 
> We don't need a new indicator that wouldn't be as compatible
> with existing user agents as the indicators we already have.
> (Consider the Degrade Gracefully principle.)

Like several others, your reply do not incorporate the authoring 
tools perspective that Larry contributed to this thread[1]. UTF-8 
as default encoding for HTML 5 documents, already has wide 
support[2][3][4][5].

The question is how to actually bring this into the draft. It has 
to be more than a half hearted recommendation. It should be more 
in the direction of how utf-8/-16 is the default for XML - a 
conformance requirement.

The draft several places sets a specification (for authors), only 
to tell how "real world" content should be treated (by browser 
applications). This same approach should be possible w.r.t. 
specifying UTF-8 as the default HTML 5 document encoding.

The goal should be that authors, when they select to create a HTML 
5 document, can take for granted that (conforming) tools defaults 
to UTF-8, unless the author actively select something else.

And since you are working on an authoring tool: Reflecting such a 
requirement in validators is a challenge. Validator.nu currently

- for "Text Field" validation, does not give any kind of
   reaction if authors fail to insert encoding information.
   This should be dealt with.
- for "Address" validation,
   * if there are no non-ASCII characters, and the encoding
     has not been declared, Validator.nu displays a warning
     but no error message. [6]
   * if such pages do contain non-ASCII characters,
     Validator.nu displays an error _and_ informs that
     it has assumed Windows 1252.[7]

Authors do not need to know, when validating, that browsers assume 
Windows 1252. Less so, if UTF-8 is defined as the default HTML 5 
encoding. Validator.nu should not behave like a browser here 
(which it doesn't do anyhow, as it does not seem to operate with 
locale defaults), this will not bring any improvements.

Instead, Validator.nu should - for documents with the HTML 5 
doctype - assume the proposed _default_ charset - UTF-8.

Lack of any encoding info anywhere should always count as error. 
Specifically, lack of a meta element with charset info should 
probably count as error.

Bob [2], Maciej [4], anything to add?

> On Jun 2, 2009, at 03:48, Leif Halvard Silli wrote:
> 
>> Is it the choice of UTF-8 as default you don't understand? If
>> so, then I'd like to quote the "Support World Languages"
>> principle.
> 
> The Support World Languages principle is satisfied by HTML5
> allowing authors easily to opt in to UTF-8. It has to be opt in
> due to the Support Existing Content and Degrade Gracefully
> principles.

Here you touch the author perspective. However, HTML 4 already 
"Support World Languages", including Unicode. If this principle is 
to have any meaning, then it must be applied to /extend/ the 
support for world languages, including the "mixing of text in 
different languages" - as called out for in that principle. We 
care for _real world support_, and real world support calls for 
the definition of UTF-8 as the default charset.

[1] http://lists.w3.org/Archives/Public/public-html/2009Jun/0069
[2] http://lists.w3.org/Archives/Public/public-html/2009Jun/0068
[3] http://lists.w3.org/Archives/Public/public-html/2009Jun/0067
[4] http://lists.w3.org/Archives/Public/public-html/2009Jun/0066
[5] http://lists.w3.org/Archives/Public/public-html/2009Jun/0099
[6]
http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Fhtml5%2FnoNonASCII.html&parser=html5
[7] 
http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Fhtml5%2FsomeNonASCII.html&parser=html5 

-- 
leif halvard silli

Received on Tuesday, 2 June 2009 16:18:16 UTC