- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Tue, 17 Feb 2004 17:54:49 -0600
- To: Bert Bos <bert@w3.org>
- Cc: www-style@w3.org
Bert Bos wrote: > 1. HTTP header > 2. BOM > 3. @charset > 4. etc. > > But this is complicated material, so: does anybody see a problem with > this? Sure. Different encodings can have the same BOM (eg UTF-16 and UCS-2, but there may also be other cases that are not quite so trivial). In case anyone is interested in what Mozilla does right now, the basic algorithm for this step is: 1) Check for a BOM. If there is one, set the "generic encoding family" (# of bytes per ASCII character and byte order based on that). 2) If there is no BOM, see whether the first couple of bytes look like a '@' in some random encoding. If they do, set the "generic encoding family" based on what it looks like. 3) If there was a '@' try to parse the whole @charset rule using the "generic encoding family" info to find the actual ascii chars (eg for UTF-16/UCS-2/whatever, look at every other byte). 4) If there wasn't an @charset rule but we got a "generic encoding family" based on the BOM, go with the most common or inclusive charset we know of that fits those criteria (eg we would go with UTF-16 over UCS-2) Steps 1, 2, 3 are basically what the XML 1.0 spec describes as the way to handle XML decls and BOMs (this is the spec that's references in the encoding selection section of CSS2, iirc). -Boris
Received on Tuesday, 17 February 2004 18:54:52 UTC