Re: Fallback to UTF-8 from Henri Sivonen on 2008-04-25 (www-validator@w3.org from April 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 25 Apr 2008 15:07:55 +0300
To: W3C Validator Community <www-validator@w3.org>
Message-Id: <2DF1F082-2081-4B96-A76D-C951A1406AAD@iki.fi>

On Apr 25, 2008, at 13:19 , Jukka K. Korpela wrote:

> Henri Sivonen wrote:
>
>> My point is that while HTML 4.01
>> doesn't specify this properly, this is a solved problem (by HTML 5)
>
> You're joking, right?

Not at all.

> "HTML 5" is a collection of incomplete sketches.

Sketches, yes, but more complete than HTML 4.01.

> HTML 4.01 rather properly specifies how the encoding shall be  
> specified.
> Data that does not do that is outside the scope of the specification.

Right, so if you want to apply rules other than early and total  
failure, the pragmatic thing to do is to follow HTML 5.

> Did you notice that the press news that tells that there are now more
> Internet users in China than in the US? Would it make sense for a
> browser used in China to assume windows-1252?

For a *browser* used the PRC, it makes the most sense to default to  
GBK when the encoding has not been declared. (Yes, GBK, not GB2312.  
Browsers treat GB2312 as GBK like they treat ISO-8859-1 as  
Windows-1252.)

>> How about "The character encoding of the document was not explicit
>> (assumed windows-1252) but the document contains non-ASCII."
>
> Everything from the "(" onwards is gibberish to most authors and also
> fairly misleading. There's no "ASCII" or "non-ASCII" when the encoding
> has not been specified.

Not true for HTML5.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 25 April 2008 12:08:37 UTC