- From: Michael Jackson <michael.jackson@bcs.org.uk> <michael.jackson@bcs.org.uk>
- Date: Mon, 21 Jan 2002 22:58:47 +0000 (GMT)
- To: Terje Bless <link@pobox.com>
- cc: Sascha Claus <SC_LE@gmx.de>, Liam Quinn <liam@htmlhelp.com>, www-validator@w3.org
On Mon, 21 Jan 2002, Terje Bless wrote: > Michael Jackson <michael.jackson@bcs.org.uk> <fi57@bcs.org.uk> wrote: > > > So Columbia place a > > > ><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> > > > > declaration in the html head section for the page and have > > the server simply return a > > > >Content-Type: text/html > > > > Whereas, Validator places no form of charset declaration > > in the html head section for the page but instead has the > > server return with > > > >Content-Type: text/html; charset=utf-8 > > > > Presumably these are different ways of achieving the same > > thing. > > > > The problem for me is not understanding the relationship > > (if any) between what `lynx-dump -head` shows for Content- > > Type, and what is coded in the page's html head section. > > > > ...and therefore, why one would want to do it (it being I > > presume achieving the same thing) the way that Validator > > does it, rather than the way that Columbia does it even > > though `the Columbia way' works on old browsers. > > No, the whole point is that the Columbia way does _not_ work. If it had > worked, your Lynx would have ceased to ³work². See? :-) > > There are a couple of things going on here. One is that both these two > documents are encoded in ³UTF-8² -- a Character Encoding your version of > Lynx does not support -- but which is somewhat backwards compatible with > ISO-Latin-1 encoding (the previous de facto standard). Both servers tell > you that the documents they return are encoded in UTF-8, but only the > Validator does so in the proper way that Lynx recognizes. > > So. When Lynx fetches the Columbia document, it finds no ³charset² > parameter in the HTTP response, and so it guesses ISO-Latin-1 (which /is/ a > supported encoding). This works due to the somewhat backward compatible > nature of UTF-8; only the ³strange² characters used are garbled, the common > alphabetic characters survive. > > When Lynx goes to fetch the Validator document, it finds the encoding > ³UTF-8² in the HTTP response. Since this is not a supported encoding, it > promptly punts and refuses to show the document at all. > > > The reason for doing it the ³Validator way² instead of the ³Columbia way² > is the same as for bringing your keys with you instead of locking them > inside the car! :-) > > The Character Encoding tells the recieveing end -- in this case Lynx -- how > to interpret the raw bytes coming over the wire and how to turn them into > ³characters² that can then be parsed for HTML markup and content. The HTTP > protocol has ³ISO-Latin-1² pre-defined as the Character Encoding, but the > contents of a HTML document can be encoded in one of any number of weird > and wonderfull ways. When you put the information about which encoding was > used into the HTTP response -- as the Validator does -- it can be > unambigously detected by browsers. > > But when you put it /inside/ the very data whose encoding you need to know, > you end up with a nasty little Catch-22. The browser can¹t, strictly, find > the bit that says ³<meta ...;charset=UTF-8>² because that bit of text > exists only as uninterpreted raw bytes. To find it, you need to engage in > some pretty hairy guesswork. > > The reason this appears to ³work² most of the time is that US-ASCII, > ISO-Latin-1, and UTF-8 are all proper subsets of each other; and in the > western world and historically on the Internet, these encodings cover all > the characters needed. In a truly international Internet, with more and > more Chinese, Thai, Vietnamese, Russian, Latin American, etc. people coming > online, these assumptions are no longer valid. > > > Netscape 4.x is the most recently released browser to have problems with > UTF-8 (that I know of). Lynx 2.6 predates it by quite a bit IIRC. > > OTOH, always spitting out UTF-8 is not a good thing to do. At some future > point, the Validator may start paying attention to the Accept-Charset > header the browser sends, and return a document in one of the encodings > requested by the browser. But given the small magnitude of the problem and > the effort involved in achieveing the feature, it probably won¹t be any > time very soon. Terje, Very many thanks for replying with a detailed response. I shall read-up on this area (³How-To define character encodings²) and then re-read your reply. I was aware that UTF-8 had its Latin-8/ISO-8859 characters in the same positions as those 7-bit sets. And yes, I had wondered whether there was a way of testing for a browser's support of UTF-8 and, if not supported, to return the same content declared as ISO-8859. I fully take your point about the increasing internationalisation and questionableness of perpetuating Latin/ ISO defaults. On this X server, all of the X11 PCF fonts I use are ISO-8859-1. I have yet to find replacements for some, let alone all, of the X fonts I have. Likely, I never will; I will always have to live with fonts that use the older, deprecated, method of supporting symbol sets rather than the exhaustive method employed by UTF. Validator's use of UTF-8 is laudable, but I wonder whether one should only really do so if one makes a concession to the past, tradition, and historicity, by returning an ISO encoded doc. I'll stop opining now ahead of learning rather more about the «subject». ;-) Thank you for taking the time to explain; I'm grateful. Best regards, MJ.
Received on Monday, 21 January 2002 17:48:51 UTC