- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Tue, 17 Feb 2004 20:15:34 -0600
- To: Ian Hickson <ian@hixie.ch>
- Cc: Bert Bos <bert@w3.org>, www-style@w3.org
Ian Hickson wrote: >>But a two-byte LE BOM followed by @charset "Mysomething"; would be >>treated as "Mysomething"? Or what? > > Is Mysomething a two-byte encoding which has the same BOM codepoint as a > UTF-16LE BOM? If yes, then yes, since Mysomething doesn't contradict the > BOM, but merely clarify it. Otherwise, no. This is based on the fact that > the BOM is higher on the list, so overrides the @charset. So in other words, in the proposed setup one would need to do the following to be robust wrt new encodings that may appear: 1) See whether there is a BOM. 2) Parse the @charset rule. 3) If there is an @charset rule and a BOM, encode U+FEFF using the charset from the @charset rule and see whether this agrees with the BOM's representation. If it does, use the @charset charset. If not, guess a charset based on the BOM's encoding. Sound right? > It means, e.g., that if you know the encoding is UTF-8 based on the > Content-Type header, you ignore a leading BOM. Ah, ok. > As far as I know the only encodings that can represent U+FEFF are: I should clarify that I am by no means an intl expert. Hence all the questions. It's just that I tend to assume that any situation like the one we have with encodings will deteriorate (that is, more random encodings will appear, possibly overlapping existing ones in most undesirable ways). I would dearly like to be wrong in this, of course. ;) > True, but it seems likely that the encoding of the @charset is more likely > to be right than the given encoding. Fair enough. -Boris
Received on Tuesday, 17 February 2004 21:15:37 UTC