Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Henri Sivonen on 2009-08-20 (public-html@w3.org from August 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 20 Aug 2009 09:39:04 +0300
To: "Phillips, Addison" <addison@amazon.com>
Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-Id: <01E726F0-AD26-4FB3-AB74-4E03F901DDD5@iki.fi>

On Aug 20, 2009, at 07:22, Phillips, Addison wrote:

> --
> Otherwise, return an implementation-defined or user-specified  
> default character encoding, with the confidence  tentative. In non- 
> legacy environments, the more comprehensive UTF-8 encoding is  
> recommended. Due to its use in legacy content, windows-1252 is  
> recommended as a default in predominantly Western demographics  
> instead. Since these encodings can in many cases be distinguished by  
> inspection, a user agent may heuristically decide which to use as a  
> default.
> --
>
> Our concerns about this text are:
>
> 1. It isn't clear what constitutes a "legacy" or "non-legacy  
> environment". We think that, for modern implementations, a bare  
> recommendation of UTF-8 would be preferable.

The Web is a legacy environment. New walled gardens that are saving in  
R&D cost by using HTML but that don't have any interop requirements  
are non-legacy environments. IIRC, this wording exists only as a  
politically correct fig leaf.

> Otherwise, return an implementation-defined or user-specified  
> default character encoding, with the confidence tentative. The UTF-8  
> encoding is recommended as a default.

This recommendation, while politically correct, is useless to  
implementors.

> The default may also be set according to the expectations and  
> predominant legacy content encodings for a given demographic or  
> audience. For example, windows-1252 is recommended as the default  
> encoding for Western European language environments. Other encodings  
> may also be used. For example, "windows-949" might be an appropriate  
> default in a Korean language runtime environment.

I think this wording would be an improvement.

> 4. We suggest adding to step (6) this note:
>
> --
> Note: The UTF-8 encoding has a highly detectable bit pattern.  
> Documents that contain bytes > 0x7F which match the UTF-8 pattern  
> are very likely to be UTF-8, while documents that do not match it  
> definitely are not. While not full autodetection, it may be  
> appropriate for a user-agent to search for this common encoding.


I think adding this note makes sense.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 20 August 2009 06:39:52 UTC