- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Tue, 17 Feb 2004 20:59:20 -0600
- To: Ian Hickson <ian@hixie.ch>
- Cc: Bert Bos <bert@w3.org>, www-style@w3.org
Ian Hickson wrote: > That would be compliant, I think. It should be equivalent to the following > algorithm. Unfortunately, your algorithm makes some unwarranted assumptions about the information the user agent has... and the sort of encodings it can handle. And they two algorithms are not quite equivalent in edge cases like people specifying UTF-16 without LE or BE and with no BOM. > 0) Set the set of encodings to include all known encodings. There may be no way to get such a list from the unicode conversion system... > 1) If there is an HTTP Content-Type header, reduce the set of encodings > to the set of encodings that the Content-Type header covered. (e.g. > if it said "text/css;charset=utf-16" then the set would be UTF-16LE, > UTF-16BE.) This requires either hardcoding knowledge of that sort in the user-agent or assuming that the unicode conversion system has an api for this. I don't believe iconv has such an api, for example. Do Windows or Mac OSX have such apis? Hardcoding means that if new encodings appear and your unicode conversion system adds support for them you're still broken. > 2) See if you can detect a BOM. If so, use that to reduce the set of > encodings to the the set of encodings that have that BOM. This requires either hardcoding knowledge of what a BOM looks like in various encodings, having an api to ask for that information from the unicode conversion system, or trying to encode it in each encoding... Given my admittedly limited knowledge of intl issues in today's OSes, the algorithm you propose is not feasibly implementable, especially if you aim to support a broad range of OSes. Once again, I would love to be proved wrong. > One would hope, given the existence of Unicode, that we will not be seeing > new encodings any more (except in specialist fields such as Punycode for > IDN, but that doesn't really count). Given Ernest's post about CESU-8, that hope seems highly unwarranted... :( -Boris
Received on Tuesday, 17 February 2004 21:59:26 UTC