Re: [CSS21] response to issue 115 (and 44)

> [Original Message]
> From: Bjoern Hoehrmann <derhoermi@gmx.net>
>
> * Boris Zbarsky wrote:
>
> >Bjoern, why is it not implementable?  Note that currently most browsers
_do_ in
> >fact implement it...  If there are serious issues with implementing this
in
> >some circumstances, could you please clearly describe them?
>
> Assume 'Content-Type: text/html', what is the encoding of e.g.
>
>   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
>   <title></title>
>   <p>...
>
> or
>
>   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
>   <meta http-equiv=Content-Type content='text/html;charset=us-ascii'>
>   <title></title>
>   <p>Björn
>
> or
>
>   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
>   <title></title>
>   <p>Bj+APY-rn
>
> Note that in auto-detect mode Internet Explorer for Windows considers
> the second example us-ascii encoded and renders "Bjvrn" and considers
> the third example as UTF-7 and renders "Björn", this does not match
> Mozilla's or Opera's behaivour, but Internet Explorer's behaivour makes
> among those most sense to me.

Not to me.  That third example contains a perfectly valid ISO-8859-1
characters and can be parsed as such.  In the absence of any other
information it should be considered ISO-8859-1.  While I consider it to
be generally a good thing to attempt to auto detect, There is nothing
here that would get in the way of that.  Now if for example it detected
a byte in the range x80 - x9F that would not be valid for  ISO-8859-1
in the context of HTML 4.01, I wouldn't mind it trying to detect the
correct  character encoding. as clearly it wouldn't be the default.
How it would detect that obviously would be implementation
dependent, but I wouldn't mind it trying.  Trying to auto detect when
the result is valid ISO-8859-1 (or whatever the default document
character encoding is for that type of document strikes me as
arrogant, especially since I can't imagine why anyone would want
to intentionally encode HTML as UTF-7 outside of e-mail and in
that case I would expect it to be explicitly mentioned in the
MIME headers if it was being used.

> Also note that a number of HTML processors try to circumvent these
> encoding issues and treat documents as us-ascii compatible encoded,
> that is, they recognize <, >, &, etc. as markup if their binary
> representation is equivalent to that in us-ascii. If all you want to do
> is extract all <link rel=stylesheet ...> from a HTML document, using
> such a parser makes a lot of sense, and in fact, as far as I can tell,
> this is what the W3C MarkUp Validator does to read the <meta> elements
> to determine the encoding and what the W3C CSS Validator does.

Well, most of the character sets out there are supersets of ISO-646-INV
which is all one needs to be able to parse HTML.  (Might not be able to
make sense of the attribute values or the element content, but that is
of secondary importance.  One doesn't need to know whether byte x23
is '#' or not when parsing.

This doesn't hold for CSS since '@', '\', '{', and '}' are not part of the
invariant set, but the set of encodings that are supersets of ISO-646-INV
and don't use those characters are mostly obsolete 7 bit character sets.
I don't know about you, but I'm not going to worry that one can't 
represent CSS in most of the 7-bit ISO-646 national variants..

Received on Saturday, 21 February 2004 20:13:58 UTC