Re: [CSS21] response to issue 115 (and 44)

* Bert Bos wrote:
>So, if we assume that we can change the browsers in time, what do we
>want in CSS3? I'd say this:
>
> 1) Trust the HTTP header (or similar out-of-band information in other
>    protocols). If the file then appears to start with a U+FEFF
>    character, ignore it. If there is a @charset at the start or after
>    that U+FEFF, ignore it. Otherwise, start parsing at the first
>    character.

I think this is wrong (0xhh refers to octet hh), suppose you have

  Content-Type: text/css;charset=MacThai

  0xDB p { color: white }

This is currently equivalent to

  \00FEFF p { color: white }

Your rule turns it into

  p { color: white }

It also turns

  Content-Type: text/css;charset=MacThai

  0xDB@charset 'MacThai'; p { color: white }

into a conforming style sheet. It leaves also undefined what it means to
appear to start with a U+FEFF, either there is a U+FEFF or there is none
and if there is, it is either a Unicode signature or a normal character.

It should also be pointed out, that (at least for HTTP and MIME)
explicit information in the header is required, otherwise processors
would never read a BOM or @charset because the encoding already has been
determined as ISO-8859-1 (HTTP) or US-ASCII (MIME) (and in fact, a
processor that chooses to adhere to CSS must violate HTTP/MIME...)

> 3) If neither the header nor looking for U+FEFF or @charset yield an
>    encoding, but this style sheet was loaded because a document
>    linked to it (or linked to a style sheet that in turn linked to
>    it, recursively), then use the encoding of the document (or style
>    style sheet) that linked to this one.

I am strictly opposed to this rule, it is confusing, it is inconsistent
with other specification, it is /not implementable/, and it yields in
inconsistent results. 

>I also omitted the CHARSET parameter of the LINK element in HTML. Is
>that a problem?

No, I strongly support leaving it out.

>The algorithm for (2) would be as follows:

Is this, apart from the different syntax of the xml declaration and
@charset, any different from the rules for application/xml?

>The cases marked with "*" in my tests above thus would not be errors
>(but should still give warnings in the CSS validator).

This is not consistent with the conformance rules for application/xml
documents; it might be possible that there are use cases for different
processor conformance requirements, but I don't see why document
conformance requirements should be any different from those for
application/xml documents.

>But what about CSS 2.1? 
>
>If we use the above in CSS 2.1 also, the question becomes if we will
>have two implementations in the next few months. Because for CSS 2.1
>to make any sense, it should become a Recommendation soon, say before
>October. Otherwise we might as well skip it and wait for CSS3.

IMO, this is not acceptable, CSS 2.1 and CSS 3.0 must use the same rules
for, after all, the same thing. Maybe I can live with the same rules but
different requirement levels, say, processors are STRONGLY RECOMMENDED
to do this and will be required in CSS 3.0; though I consider such
tricks of little use for interoperability, it just adds complexity and
confusion which probably rather reduces interoperability at some point.

You did not address what to do if the processor encounters an encoding
error. Simple case, an iso-8859-1 encoded document:

  Content-Type: text/css;charset=us-ascii

  body { background-color: black }
  /* added 01-01-2003, Björn Höhrmann */  
  body { color: white }

I determine in my application us-ascii encoding and tell my transcoder
to transcode the octet stream to my internal character representation.
Depending on my transcoder the following might happen:

  * the transcoder does not support us-ascii, throws an error and
    provides no access to the content of the style sheet

  * the transcoder detects and throws an error and provides no access
    to the content of the style sheet

  * the transcoder reads the document up to the first U+00F6, I
    have an incomplete style sheet for which parsing rules are
    undefined. If I infer a closing */ the user gets black text
    on black background, i.e., the document is inaccessible

  * the transcoder ignores the invalid sequences and reads only
    the low seven bits of each octet, it would report my name as
    "Bjvrn Hvhrmann"; not a problem here, but certainly a problem
    in a case like

      .björn { color: white }
      .bj\0000f6rn { background-color: black }

    where this again yields in an inaccessible document.

  * the transcoder manages to replace the invalid sequences by a
    replacement character, it would report my name as e.g.
    "Bj?rn H?hrmann", same situation as in the previous case

  * the transcoder assumes some commonly used compatible superset
    of the given encoding, say Windows-1252 and reports my name
    correctly as "Björn Höhrmann". It might also assume Shift_jis
    in which case my name could become Bj<U+E49A>n H<U+E490>rmann
    due to error recovery in the Shift_Jis decoder. Note that the
    transcoder consumes two octets, if this happens with /*ü*/ you
    are again in trouble, as you obviously are if this happens
    outside a comment.

  * the transcoder just skips the invalid sequences and returns my
    name as "Bjrn Hhrmann". Same problem again.

  * the transcoder uses complex heuristics to determine the most
    likely real encoding, might work, might fail. In case of failure
    one of the previously mentioned problems might arise.

  * the transcoder does something else

As an implementer you might not have a choice between those possibilites
you are already lucky if the transcoder reports an error which would
allow you to inform the user to prevent demage, not to mention that an
implementer might not be an I18N expert to make a choice or handle this
more properly than his tools. Not to mention that handling encoding
issues improperly has been subject to a number of security
vulnerabilities. Simple example:

  p::before { content: url(file<error>://localhost/etc/passwd) }

One part of the application considers the <error> to make this something
different than a reference to a local file, a different does not due to
error recovery, the user agents loads the local file, makes it available
to some script which then steals my passwords. Of course, this is a very
very bad example, the application design is bad and everything else is
bad and broken and worse. I know. My point is just that if I as an
implementer have a choice between a tolerant transcoder and lots of
strict and check options and security flags, I will most certainly turn
them all on.

I am thus convinced that rejecting style sheets with encoding errors is

  * much simpler to understand
  * much simpler to implement
  * more likely to yield in accessible documents
  * more secure
  * more consistent

and I want at the very least be explicitly allowed to do that in my
applications. I am open for proposals for well-defined error recovery
for UTF-8 like Mikko suggested if this works and helps in a transition
phase, but I somewhat doubt that this will happen, get implemented and
work good enough to make it worth the effort. To implicitly or
explicitly require any kind of error recovery from encoding errors is
nonetheless all wrong to me.

Received on Saturday, 21 February 2004 01:04:32 UTC