RE: Auto-detect and encodings in HTML5

(this is a PERSONAL response)

Tomas observed:
> 
> UTF8 should be the last option in a set of rules; e.g.,
> 
>  - Get if from the HTTP header
>  - If not, get if from META
>  - If not, ...
>  - if not UTF8
> 

The problem with making UTF-8 the "last resort" encoding is that, ironically, it is possible to detect when something isn't UTF-8 and thus know that the encoding selected is wrong (this is not true of most encodings). If a document really isn't UTF-8, the byte pattern will quite probably reveal that fact, although possibly after an inconveniently large number of bytes in the document have been read. So to make an encoding the "last resort" and presenting data in a way known to be flawed seems less than ideal :-(. It might be better to offer the user the opportunity to correct the encoding, etc., in that case.

UTF-8 might be a good guess for higher in the encoding detection stack, though, and by all means should be the "default" (that is, recommended) encoding for authoring Web documents. If encoding announcement (via meta or some other mechanism) can be required in HTML5, it would also be good to make it the default encoding there. 

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Tuesday, 2 June 2009 01:09:11 UTC