Re: Fallback to UTF-8

On Apr 28, 2008, at 04:43 , olivier Thereaux wrote:

> On 24-Apr-08, at 5:10 PM, Henri Sivonen wrote:
>> More precisely for text/html:
>> http://www.w3.org/html/wg/html5/#determining
>>
>> Step 7. defines Windows-1252 as the general default which can be  
>> different in non-Western browser installations. Global online apps  
>> like validators should probably stick to Windows-1252.
>
> Henri, this is an interesting and important statement in the HTML5  
> spec. How does the group feel about the inconsistency this created  
> between the spec and defaults stated by other specifications, such as
>
> http://www.ietf.org/rfc/rfc2854.txt
> “ Section 3.7.1, defines that "media subtypes of the 'text' type are
> defined to have a default charset value of 'ISO-8859-1'".”
> (ditto RFC 2616)
>
> This is the inconsistency at the core of the issue, isn't it.
>
> I heard that the group working on HTTPbis had considered changing  
> the default, but had not managed to reach consensus yet.

I'd rather not speak for the HTML WG as a group. However, my own take  
on this is that what HTML 5 now says closely reflects what browsers  
already do. Specs that say something notably different will in most  
likelihood end up being irrelevant to writing software for consuming  
text/html content for non-validation purposes. I think it isn't useful  
for validators to diverge from other text/html consumers on this point.

> Is the HTML WG considering updating rfc2854?

Not to my knowledge, although the WG probably should in due course.

>> (The mention of UTF-8 there is a token gesture; the Web is a legacy  
>> system, so UTF-8 for non-legacy does not apply.)
>
> This sounds rather like a subjective statement, which I would be  
> wary of. Of course, the HTML5 spec is here to fix things in a  
> backward-compatible way, but specifications are forward looking, not  
> just back - and checkers are here in part to help move the landscape  
> futureward. Or, at least, so am I told all the time by the likes of  
> timbl :).
>
> I also note in the HTML5 specification:
> “Authors are encouraged to use UTF-8. Conformance checkers may  
> advise against authors using legacy encodings.”
>
> So is this a question of a future-looking default (utf8) versus  
> conservative default (win1252)? If so, I would argue that a checker  
> should favor utf8 first, and fallback to win1252 second, no?

I think futureward is *declared* UTF-8. Indeed HTML5 encourages  
authors to use UTF-8 but not by relying on defaulting without  
declaration.

For *general* (i.e. non-validator) HTML consumption advice, defaulting  
to UTF-8 is a rather bad idea given existing content. Windows-1252,  
GBK, Big5, Shift_JIS, EUC-KR, etc. depending on context are all better  
default guesses when the encoding has not been declared.

Anyway, I think the crux of what HTML5 says on this issue for  
validation is that non-declared non-ASCII is an error regardless of  
what ASCII-superset default was used to get far enough to detect that  
situation.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 28 April 2008 16:15:02 UTC