- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 20 Aug 2009 09:39:04 +0300
- To: "Phillips, Addison" <addison@amazon.com>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
On Aug 20, 2009, at 07:22, Phillips, Addison wrote: > -- > Otherwise, return an implementation-defined or user-specified > default character encoding, with the confidence tentative. In non- > legacy environments, the more comprehensive UTF-8 encoding is > recommended. Due to its use in legacy content, windows-1252 is > recommended as a default in predominantly Western demographics > instead. Since these encodings can in many cases be distinguished by > inspection, a user agent may heuristically decide which to use as a > default. > -- > > Our concerns about this text are: > > 1. It isn't clear what constitutes a "legacy" or "non-legacy > environment". We think that, for modern implementations, a bare > recommendation of UTF-8 would be preferable. The Web is a legacy environment. New walled gardens that are saving in R&D cost by using HTML but that don't have any interop requirements are non-legacy environments. IIRC, this wording exists only as a politically correct fig leaf. > Otherwise, return an implementation-defined or user-specified > default character encoding, with the confidence tentative. The UTF-8 > encoding is recommended as a default. This recommendation, while politically correct, is useless to implementors. > The default may also be set according to the expectations and > predominant legacy content encodings for a given demographic or > audience. For example, windows-1252 is recommended as the default > encoding for Western European language environments. Other encodings > may also be used. For example, "windows-949" might be an appropriate > default in a Korean language runtime environment. I think this wording would be an improvement. > 4. We suggest adding to step (6) this note: > > -- > Note: The UTF-8 encoding has a highly detectable bit pattern. > Documents that contain bytes > 0x7F which match the UTF-8 pattern > are very likely to be UTF-8, while documents that do not match it > definitely are not. While not full autodetection, it may be > appropriate for a user-agent to search for this common encoding. I think adding this note makes sense. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 20 August 2009 06:39:50 UTC