- From: Ernest Cline <ernestcline@mindspring.com>
- Date: Sat, 21 Feb 2004 20:13:55 -0500
- To: "Bjoern Hoehrmann" <derhoermi@gmx.net>
- Cc: "WWW Style" <www-style@w3.org>
> [Original Message] > From: Bjoern Hoehrmann <derhoermi@gmx.net> > > * Boris Zbarsky wrote: > > >Bjoern, why is it not implementable? Note that currently most browsers _do_ in > >fact implement it... If there are serious issues with implementing this in > >some circumstances, could you please clearly describe them? > > Assume 'Content-Type: text/html', what is the encoding of e.g. > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> > <title></title> > <p>... > > or > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> > <meta http-equiv=Content-Type content='text/html;charset=us-ascii'> > <title></title> > <p>Björn > > or > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> > <title></title> > <p>Bj+APY-rn > > Note that in auto-detect mode Internet Explorer for Windows considers > the second example us-ascii encoded and renders "Bjvrn" and considers > the third example as UTF-7 and renders "Björn", this does not match > Mozilla's or Opera's behaivour, but Internet Explorer's behaivour makes > among those most sense to me. Not to me. That third example contains a perfectly valid ISO-8859-1 characters and can be parsed as such. In the absence of any other information it should be considered ISO-8859-1. While I consider it to be generally a good thing to attempt to auto detect, There is nothing here that would get in the way of that. Now if for example it detected a byte in the range x80 - x9F that would not be valid for ISO-8859-1 in the context of HTML 4.01, I wouldn't mind it trying to detect the correct character encoding. as clearly it wouldn't be the default. How it would detect that obviously would be implementation dependent, but I wouldn't mind it trying. Trying to auto detect when the result is valid ISO-8859-1 (or whatever the default document character encoding is for that type of document strikes me as arrogant, especially since I can't imagine why anyone would want to intentionally encode HTML as UTF-7 outside of e-mail and in that case I would expect it to be explicitly mentioned in the MIME headers if it was being used. > Also note that a number of HTML processors try to circumvent these > encoding issues and treat documents as us-ascii compatible encoded, > that is, they recognize <, >, &, etc. as markup if their binary > representation is equivalent to that in us-ascii. If all you want to do > is extract all <link rel=stylesheet ...> from a HTML document, using > such a parser makes a lot of sense, and in fact, as far as I can tell, > this is what the W3C MarkUp Validator does to read the <meta> elements > to determine the encoding and what the W3C CSS Validator does. Well, most of the character sets out there are supersets of ISO-646-INV which is all one needs to be able to parse HTML. (Might not be able to make sense of the attribute values or the element content, but that is of secondary importance. One doesn't need to know whether byte x23 is '#' or not when parsing. This doesn't hold for CSS since '@', '\', '{', and '}' are not part of the invariant set, but the set of encodings that are supersets of ISO-646-INV and don't use those characters are mostly obsolete 7 bit character sets. I don't know about you, but I'm not going to worry that one can't represent CSS in most of the 7-bit ISO-646 national variants..
Received on Saturday, 21 February 2004 20:13:58 UTC