Re: Auto-detect and encodings in HTML5 from Albert Lunde on 2009-06-02 (public-html@w3.org from June 2009)

From: Albert Lunde <atlunde@panix.com>
Date: Tue, 2 Jun 2009 14:25:28 -0400
To: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org"@panix.com
Message-ID: <20090602182527.GA4529@panix.com>

On Tue, Jun 02, 2009 at 06:27:45PM +0100, Geoffrey Sneddon wrote:
> On 2 Jun 2009, at 17:17, Leif Halvard Silli wrote:
>> The question is how to actually bring this into the draft. It has to  
>> be more than a half hearted recommendation. It should be more in the  
>> direction of how utf-8/-16 is the default for XML - a conformance  
>> requirement.
>
> One possibility would be to just say something like, "conforming  
> documents MUST be encoded as UTF-8 and declare themselves to be so".
>
> The biggest problem I see with that is that from an RFC 2119 point of  
> view using UTF-8 isn't required for interoperability (for a start, UAs  
> are required to support Windows-1252 as well).

* This talks about Windows-1252 as though it is advocated by some
kind of standard, when really it is just the commonest default
interpretation for unlabeled content (in certian browsers, on certian 
operating systems, in certian regions) and one of the common
forms of mislabeled content.

The existing specs are talking about ISO 8859-1, ISO 10646, and/or UTF-8.

The problem of unlabeled/mislabled content being intepreted in some 
client-defined way has been a problem for internationalization since 
before the HTML 2.0 spec.

As others have, said the default character encoding the works "best"
for humans using web browsers (not perhaps, best for security or reliability)
varies by the locale and audience.

* Security is improved by making character-encoding detection, and 
content-sniffing, if any, unambigious and relatively safe.

Performance is improved by making the process simple.

Neither security nor performance is served by adding multiple
levels of complex compatibility hacks.

* Declaring that, say, HTML5 documents should be encoded in UTF-8
and declare their encoding with a meta element early in the document, 
seems like a more limited aim that trying to profile deducing character 
encoding based on DOCTYPE.

* However, the prevalance of tag soup indicates that non-conformant
documents will continue to be high, regardless of specs.

-- 
    Albert Lunde  albert-lunde@northwestern.edu
                  atlunde@panix.com  (new address for personal mail)
                  albert-lunde@nwu.edu (old address)

Received on Tuesday, 2 June 2009 18:28:37 UTC