Re: [CSS21] response to issue 115 (and 44) from Bjoern Hoehrmann on 2004-02-21 (www-style@w3.org from February 2004)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 21 Feb 2004 22:15:30 +0100
To: Boris Zbarsky <bzbarsky@MIT.EDU>
Cc: "WWW Style" <www-style@w3.org>
Message-ID: <4038b3ed.785480100@smtp.bjoern.hoehrmann.de>
* Boris Zbarsky wrote:
>> It should also be pointed out, that (at least for HTTP and MIME)
>> explicit information in the header is required, otherwise processors
>> would never read a BOM or @charset because the encoding already has been
>> determined as ISO-8859-1 (HTTP)
>
>But higher-level protocols can override this (as HTML does, eg).

Well, strictly speaking, an HTTP implementation could return characters
instead of octets for all text/* types since the encoding is clearly
determined, and hence it is too late for a HTML implementation to choose
a different encoding. But I think this is probably too theoretical and
offtopic here.

>Bjoern, why is it not implementable?  Note that currently most browsers _do_ in
>fact implement it...  If there are serious issues with implementing this in
>some circumstances, could you please clearly describe them?

Assume 'Content-Type: text/html', what is the encoding of e.g.

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
  <title></title>
  <p>...

or

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
  <meta http-equiv=Content-Type content='text/html;charset=us-ascii'>
  <title></title>
  <p>Björn

or

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
  <title></title>
  <p>Bj+APY-rn

Note that in auto-detect mode Internet Explorer for Windows considers
the second example us-ascii encoded and renders "Bjvrn" and considers
the third example as UTF-7 and renders "Björn", this does not match
Mozilla's or Opera's behaivour, but Internet Explorer's behaivour makes
among those most sense to me.

Also note that a number of HTML processors try to circumvent these
encoding issues and treat documents as us-ascii compatible encoded, that
is, they recognize <, >, &, etc. as markup if their binary
representation is equivalent to that in us-ascii. If all you want to do
is extract all <link rel=stylesheet ...> from a HTML document, using
such a parser makes a lot of sense, and in fact, as far as I can tell,
this is what the W3C MarkUp Validator does to read the <meta> elements
to determine the encoding and what the W3C CSS Validator does.

>> >I also omitted the CHARSET parameter of the LINK element in HTML. Is
>> >that a problem?
>> 
>> No, I strongly support leaving it out.
>
>May I ask why?  (I have no really strong opinion here, but this is a source of
>out-of-band charset information that page/sheet authors _do_ control, unlike
>HTTP headers.)

It all starts with a confusing specification HTML 4.01 says for charset

[...]
  This attribute specifies the character encoding of the resource
  designated by the link. Please consult the section on character
  encodings for more details. 
[...]

This text, combined with the general rule that the first encoding
declaration wins, actually implies to me that the charset attribute
*overrides* the HTTP header. If you don't get utterly confused by the
referenced part of the specification you find out that it is not
supposed to do this.

Other than that, this is not obvious to authors, debugging "funny
characters" that might be the result of relying on this attribute is
quite difficult. It is also inconsistent with rules I think more people
actually understand, the rules for application/xml for example. And
after all, the number of authors who both know about the existance of
the attribute and use it where it actually solves a problem is probably
not worth mentioning. Less rule makes things simpler, hence my
preference.

>> I am thus convinced that rejecting style sheets with encoding errors is
>> 
>>   * much simpler to understand
>>   * much simpler to implement
>>   * more likely to yield in accessible documents
>>   * more secure
>>   * more consistent
>
>Unfortunately, it'll also break a large number of real-world websites (eg the
>Opera site mentioned earlier in this thread).  :(  But other than that, it does
>indeed have many advantages.

Documents that trigger strict mode in recent browsers that reference a
style sheet that contains non-utf-8 sequences that is delivered without
any encoding information are probably way less than 1% of the web... And
among those, if the specification said something to the effect that all
style sheets should have a proper @charset, I could go and spread the
word through the W3C CSS Validator...
Received on Saturday, 21 February 2004 16:15:26 UTC