Re: [CSS21] response to issue 115 (and 44)

[Sorry, long and complicated stuff ahead...]

This problem of finding the encoding of a file is complicated, not
just because it is so hard to imagine for spec writers and programmers
what a program actually sees when the encoding is wrong, but also for
other reasons:

  - Most HTTP servers don't send the charset param, we're not going to
    change that overnight.

  - Both the BOM and the @charset are basically hacks to work around
    such HTTP servers (and for files served over less capable
    protocols, such as FTP or local files). In an ideal world, the
    author wouldn't need to know the encoding his authoring tool uses.

  - Since CSS has @charset (a complete solution), it doesn't need a
    BOM (a partial solution), but many text editors insert the BOM
    without the author knowing it, so CSS can't forbid it, without
    confusing a lot of people. (And it is true that knowing whether
    your system is big-endian or not is hard; easier to just say it's
    UTF-16 and add a BOM.)

  - So we're stuck with files that have nothing, an HTTP header, a
    @charset, a BOM or any combination of the latter three. No problem
    if they all agree, but what if they conflict?

  - W3C's I18N group wants to promote UTF-8 and would like every new
    spec that comes out to explicitly say that UTF-8 is recommended,
    and even that UTF-8 should be taken as the default, after any
    explicit rules and instead of whatever heuristics a browser
    applies.

On the other hand, it may be a hard problem, but we may not have to
solve it yet. BOM and @charset are very rare in style sheets on the
Web. Cases where they conflict with each other or with the HTTP header
are even harder to find. So if we have interoperability on the cases
that are obviously correct, that is maybe good enough for CSS 2.1. We
can then take some time to discuss what we want for CSS3.

Meanwhile, it seems all implementations agree to ignore the BOM and
@charset if there is an HTTP header. OK, one problem solved.

They also seem to agree that in the absence of header, BOM and
@charset, the heuristic that works best in practice is to assume the
same encoding as the document that linked to this one. Another problem
solved.

For the BOM and the @charset, browsers show no clear interoperability.
A small test (http://www.w3.org/Style/Examples/010/) I did seemed to
show the following:

Explanation:
  - the column "file" links to the test; "*" is a dubious case
  - the column "HTML" shows the encoding of the HTML file;
  - "BOM" says what BOM, if any, the linked CSS file has;
  - "@charset" gives the @charset of the linked CSS, if any;
  - "F0.8" is Firefox 0.8 (Linux)
  - "K3.2" is Konqueror 3.2.0 (Linux)
  - "O7.5" is Opera 7.50 P1 (Linux)
  - "S1.2" is Safari 1.2 (Mac OS X)
  - "I5.2" is Internet Explorer 5.2 (Mac OS X)

file | HTML       | BOM    | @charset   | F0.8 | K3.2 | O7.5 | S1.2 | I5.2
-----+------------+--------+------------+------+------+------+------+------
[1]  | iso-8859-1 | -      | iso-8859-1 | PASS | PASS | PASS | PASS | 6)
[2]  | iso-8859-1 | utf-16 | -          | PASS | PASS | PASS | PASS | 7)
[3]  | iso-8859-1 | utf-16 | utf-16     | 1)   | PASS | PASS | PASS | 7)
[4]* | iso-8859-1 | utf-16 | iso-8859-1 | 1)   | PASS | PASS | PASS | 7)
[5]  | iso-8859-1 | utf-8  | utf-8      | PASS | PASS | PASS | PASS | 6)
[6]* | iso-8859-1 | utf-8  | iso-8859-1 | 2)   | PASS | PASS | PASS | 8)
[7]  | iso-8859-1 | utf-8  | -          | PASS | PASS | PASS | PASS | 6)
[8]  | iso-8859-1 | -      | utf-8      | PASS | 3)   | PASS | 5)   | 6)
[9]  | utf-8      | -      | -          | PASS | 4)   | PASS | 5)   | 8)

(1) it seems F 0.8 doesn't read utf-16 style sheets that have @charset...
(2) it seems F 0.8 ignores the style sheet if the BOM and @charset conflict
(3) it seems K 3.2 consistently ignores the @charset, BOM or no BOM
(4) it seems K 3.2 defaults to OS default encoding
(5) it seems S 1.2 is like K 3.2 (not too surprising)
(6) this is surprising; no idea why this fails
(7) it seems I 5.2 doesn't read utf-16 style sheets at all
(8) no explanation suggests itself

[1] http://www.w3.org/Style/Examples/010/iso-8859-1-correct.html
[2] http://www.w3.org/Style/Examples/010/utf-16-bom.html
[3] http://www.w3.org/Style/Examples/010/utf-16-correct.html
[4] http://www.w3.org/Style/Examples/010/utf-16-incorrect.html
[5] http://www.w3.org/Style/Examples/010/utf-8-bom-correct.html
[6] http://www.w3.org/Style/Examples/010/utf-8-bom-incorrect.html
[7] http://www.w3.org/Style/Examples/010/utf-8-bom.html
[8] http://www.w3.org/Style/Examples/010/utf-8-no-bom-correct.html
[9] http://www.w3.org/Style/Examples/010/utf-8-none.html

(I checked the test files carefully, but I may have made an error, of
course; it's quite easy to create invalid XHTML or inadvertently
re-encode a file...)


So, if we assume that we can change the browsers in time, what do we
want in CSS3? I'd say this:

 1) Trust the HTTP header (or similar out-of-band information in other
    protocols). If the file then appears to start with a U+FEFF
    character, ignore it. If there is a @charset at the start or after
    that U+FEFF, ignore it. Otherwise, start parsing at the first
    character.

 2) If the header gives no encoding, try to recognize a U+FEFF
    and/or @charset in various encodings (see algorithm below). Then
    use the encoding that worked to parse the remainder of the file.

 3) If neither the header nor looking for U+FEFF or @charset yield an
    encoding, but this style sheet was loaded because a document
    linked to it (or linked to a style sheet that in turn linked to
    it, recursively), then use the encoding of the document (or style
    style sheet) that linked to this one.

 4) If all else fails, assume UTF-8.

(3) is a heuristic, of course, but it seems the best one. The
alternatives are to assume UTF-8 or to ask the user. But asking the
user seems a bit impractical. Assuming UTF-8 would be good, but when
UTF-8 becomes more and more used for HTML, this heuristic will have
nearly the same effect anyway.

I also omitted the CHARSET parameter of the LINK element in HTML. Is
that a problem?

The algorithm for (2) would be as follows:

  2a) If the first bytes are 00 00 FE FF, use UCS-4 (1234 order).
      Remove those bytes. If they are followed by "@charset
      <anything>;" remove that as well.

  2b) If the first bytes are FF FE 00 00, use UCS-4 (4321 order).
      Remove those bytes. If they are followed by "@charset
      <anything>;" remove that as well.

  2c) If the first bytes are 00 00 FF FE, use UCS-4 (2143 order).
      Remove those bytes. If they are followed by "@charset
      <anything>;" remove that as well.

  2d) If the first bytes are FE FF 00 00, use UCS-4 (3412 order).
      Remove those bytes. If they are followed by "@charset
      <anything>;" remove that as well.

  2e) If the first bytes are FE FF xx, where xx is not 00, use UTF-16-BE.
      Remove the first two bytes. If they are followed by "@charset
      <anything>;", remove that as well.

  2f) If the first bytes are FF FE xx, where xx is not 00, use UTF-16-LE.
      Remove the first two bytes. If they are followed by "@charset
      <anything>;", remove that as well.

  2g) If the first bytes are EF BB BF, use UTF-8.
      Remove those bytes. If they are followed by "@charset
      <anything>;" remove that as well.

  2h) For all encodings X that the UA knows, starting with UTF-8,
      UTF-16-BE and UTF-16-LE, if the first bytes correspond to
      '@charset "X";' (case-insensitive) in encoding X, use that
      encoding X and remove those bytes.

The cases marked with "*" in my tests above thus would not be errors
(but should still give warnings in the CSS validator).

Some UAs may not handle all of the encodings. We could add to the
various profiles (TV, mobile, print...) which encodings are to be
supported as a minimum. Or just leave that to the market.

Unfortunately, that's quite a bit of code to write... :-(



But what about CSS 2.1? 

If we use the above in CSS 2.1 also, the question becomes if we will
have two implementations in the next few months. Because for CSS 2.1
to make any sense, it should become a Recommendation soon, say before
October. Otherwise we might as well skip it and wait for CSS3.

But so far, only Opera passes my little test.

Should we keep rule (2) vague, and say that UAs must use the BOM and
the @charset, but that we don't yet define what happens if the BOM and
the charset conflict?



Bert
-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/people/bos/                              W3C/ERCIM
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France

Received on Friday, 20 February 2004 17:26:22 UTC