- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sat, 21 Feb 2004 22:15:30 +0100
- To: Boris Zbarsky <bzbarsky@MIT.EDU>
- Cc: "WWW Style" <www-style@w3.org>
* Boris Zbarsky wrote: >> It should also be pointed out, that (at least for HTTP and MIME) >> explicit information in the header is required, otherwise processors >> would never read a BOM or @charset because the encoding already has been >> determined as ISO-8859-1 (HTTP) > >But higher-level protocols can override this (as HTML does, eg). Well, strictly speaking, an HTTP implementation could return characters instead of octets for all text/* types since the encoding is clearly determined, and hence it is too late for a HTML implementation to choose a different encoding. But I think this is probably too theoretical and offtopic here. >Bjoern, why is it not implementable? Note that currently most browsers _do_ in >fact implement it... If there are serious issues with implementing this in >some circumstances, could you please clearly describe them? Assume 'Content-Type: text/html', what is the encoding of e.g. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <title></title> <p>... or <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <meta http-equiv=Content-Type content='text/html;charset=us-ascii'> <title></title> <p>Björn or <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <title></title> <p>Bj+APY-rn Note that in auto-detect mode Internet Explorer for Windows considers the second example us-ascii encoded and renders "Bjvrn" and considers the third example as UTF-7 and renders "Björn", this does not match Mozilla's or Opera's behaivour, but Internet Explorer's behaivour makes among those most sense to me. Also note that a number of HTML processors try to circumvent these encoding issues and treat documents as us-ascii compatible encoded, that is, they recognize <, >, &, etc. as markup if their binary representation is equivalent to that in us-ascii. If all you want to do is extract all <link rel=stylesheet ...> from a HTML document, using such a parser makes a lot of sense, and in fact, as far as I can tell, this is what the W3C MarkUp Validator does to read the <meta> elements to determine the encoding and what the W3C CSS Validator does. >> >I also omitted the CHARSET parameter of the LINK element in HTML. Is >> >that a problem? >> >> No, I strongly support leaving it out. > >May I ask why? (I have no really strong opinion here, but this is a source of >out-of-band charset information that page/sheet authors _do_ control, unlike >HTTP headers.) It all starts with a confusing specification HTML 4.01 says for charset [...] This attribute specifies the character encoding of the resource designated by the link. Please consult the section on character encodings for more details. [...] This text, combined with the general rule that the first encoding declaration wins, actually implies to me that the charset attribute *overrides* the HTTP header. If you don't get utterly confused by the referenced part of the specification you find out that it is not supposed to do this. Other than that, this is not obvious to authors, debugging "funny characters" that might be the result of relying on this attribute is quite difficult. It is also inconsistent with rules I think more people actually understand, the rules for application/xml for example. And after all, the number of authors who both know about the existance of the attribute and use it where it actually solves a problem is probably not worth mentioning. Less rule makes things simpler, hence my preference. >> I am thus convinced that rejecting style sheets with encoding errors is >> >> * much simpler to understand >> * much simpler to implement >> * more likely to yield in accessible documents >> * more secure >> * more consistent > >Unfortunately, it'll also break a large number of real-world websites (eg the >Opera site mentioned earlier in this thread). :( But other than that, it does >indeed have many advantages. Documents that trigger strict mode in recent browsers that reference a style sheet that contains non-utf-8 sequences that is delivered without any encoding information are probably way less than 1% of the web... And among those, if the specification said something to the effect that all style sheets should have a proper @charset, I could go and spread the word through the W3C CSS Validator...
Received on Saturday, 21 February 2004 16:15:26 UTC