- From: Terje Bless <link@pobox.com>
- Date: Mon, 21 Jan 2002 18:20:19 +0100
- To: Michael Jackson <michael.jackson@bcs.org.uk>
- cc: Sascha Claus <SC_LE@gmx.de>, Liam Quinn <liam@htmlhelp.com>, www-validator@w3.org
Michael Jackson <michael.jackson@bcs.org.uk> <fi57@bcs.org.uk> wrote: > So Columbia place a > ><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> > > declaration in the html head section for the page and have > the server simply return a > >Content-Type: text/html > > Whereas, Validator places no form of charset declaration > in the html head section for the page but instead has the > server return with > >Content-Type: text/html; charset=utf-8 > > Presumably these are different ways of achieving the same > thing. > > The problem for me is not understanding the relationship > (if any) between what `lynx-dump -head` shows for Content- > Type, and what is coded in the page's html head section. > > ...and therefore, why one would want to do it (it being I > presume achieving the same thing) the way that Validator > does it, rather than the way that Columbia does it even > though `the Columbia way' works on old browsers. No, the whole point is that the Columbia way does _not_ work. If it had worked, your Lynx would have ceased to ³work². See? :-) There are a couple of things going on here. One is that both these two documents are encoded in ³UTF-8² -- a Character Encoding your version of Lynx does not support -- but which is somewhat backwards compatible with ISO-Latin-1 encoding (the previous de facto standard). Both servers tell you that the documents they return are encoded in UTF-8, but only the Validator does so in the proper way that Lynx recognizes. So. When Lynx fetches the Columbia document, it finds no ³charset² parameter in the HTTP response, and so it guesses ISO-Latin-1 (which /is/ a supported encoding). This works due to the somewhat backward compatible nature of UTF-8; only the ³strange² characters used are garbled, the common alphabetic characters survive. When Lynx goes to fetch the Validator document, it finds the encoding ³UTF-8² in the HTTP response. Since this is not a supported encoding, it promptly punts and refuses to show the document at all. The reason for doing it the ³Validator way² instead of the ³Columbia way² is the same as for bringing your keys with you instead of locking them inside the car! :-) The Character Encoding tells the recieveing end -- in this case Lynx -- how to interpret the raw bytes coming over the wire and how to turn them into ³characters² that can then be parsed for HTML markup and content. The HTTP protocol has ³ISO-Latin-1² pre-defined as the Character Encoding, but the contents of a HTML document can be encoded in one of any number of weird and wonderfull ways. When you put the information about which encoding was used into the HTTP response -- as the Validator does -- it can be unambigously detected by browsers. But when you put it /inside/ the very data whose encoding you need to know, you end up with a nasty little Catch-22. The browser can¹t, strictly, find the bit that says ³<meta ...;charset=UTF-8>² because that bit of text exists only as uninterpreted raw bytes. To find it, you need to engage in some pretty hairy guesswork. The reason this appears to ³work² most of the time is that US-ASCII, ISO-Latin-1, and UTF-8 are all proper subsets of each other; and in the western world and historically on the Internet, these encodings cover all the characters needed. In a truly international Internet, with more and more Chinese, Thai, Vietnamese, Russian, Latin American, etc. people coming online, these assumptions are no longer valid. Netscape 4.x is the most recently released browser to have problems with UTF-8 (that I know of). Lynx 2.6 predates it by quite a bit IIRC. OTOH, always spitting out UTF-8 is not a good thing to do. At some future point, the Validator may start paying attention to the Accept-Charset header the browser sends, and return a document in one of the encodings requested by the browser. But given the small magnitude of the problem and the effort involved in achieveing the feature, it probably won¹t be any time very soon.
Received on Monday, 21 January 2002 12:20:32 UTC