Re: http://validator.w3.org/feedback.html from Michael Jackson <michael.jackson@bcs.org.uk> on 2002-01-21 (www-validator@w3.org from January 2002)

From: Michael Jackson <michael.jackson@bcs.org.uk> <michael.jackson@bcs.org.uk>
Date: Mon, 21 Jan 2002 22:58:47 +0000 (GMT)
To: Terje Bless <link@pobox.com>
cc: Sascha Claus <SC_LE@gmx.de>, Liam Quinn <liam@htmlhelp.com>, www-validator@w3.org
Message-ID: <Pine.GSU.4.03.10201212233440.12476-100000@angel2.cityscape.co.uk>
On Mon, 21 Jan 2002, Terje Bless wrote:

> Michael Jackson <michael.jackson@bcs.org.uk> <fi57@bcs.org.uk> wrote:
> 
> > So Columbia place a 
> >
> ><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
> >
> > declaration in the html head section for the page and have
> > the server simply return a
> >
> >Content-Type: text/html
> >
> > Whereas, Validator places no form of charset declaration
> > in the html head section for the page but instead has the
> > server return with
> >
> >Content-Type: text/html; charset=utf-8
> >
> > Presumably these are different ways of achieving the same
> > thing.
> >
> > The problem for me is not understanding the relationship
> > (if any) between what `lynx-dump -head` shows for Content-
> > Type, and what is coded in the page's html head section.
> >
> > ...and therefore, why one would want to do it (it being I
> > presume achieving the same thing) the way that Validator
> > does it, rather than the way that Columbia does it even
> > though `the Columbia way' works on old browsers.
> 
> No, the whole point is that the Columbia way does _not_ work. If it had
> worked, your Lynx would have ceased to ³work². See? :-)
> 
> There are a couple of things going on here. One is that both these two
> documents are encoded in ³UTF-8² -- a Character Encoding your version of
> Lynx does not support -- but which is somewhat backwards compatible with
> ISO-Latin-1 encoding (the previous de facto standard). Both servers tell
> you that the documents they return are encoded in UTF-8, but only the
> Validator does so in the proper way that Lynx recognizes.
> 
> So. When Lynx fetches the Columbia document, it finds no ³charset²
> parameter in the HTTP response, and so it guesses ISO-Latin-1 (which /is/ a
> supported encoding). This works due to the somewhat backward compatible
> nature of UTF-8; only the ³strange² characters used are garbled, the common
> alphabetic characters survive.
> 
> When Lynx goes to fetch the Validator document, it finds the encoding
> ³UTF-8² in the HTTP response. Since this is not a supported encoding, it
> promptly punts and refuses to show the document at all.
> 
> 
> The reason for doing it the ³Validator way² instead of the ³Columbia way²
> is the same as for bringing your keys with you instead of locking them
> inside the car! :-)
> 
> The Character Encoding tells the recieveing end -- in this case Lynx -- how
> to interpret the raw bytes coming over the wire and how to turn them into
> ³characters² that can then be parsed for HTML markup and content. The HTTP
> protocol has ³ISO-Latin-1² pre-defined as the Character Encoding, but the
> contents of a HTML document can be encoded in one of any number of weird
> and wonderfull ways. When you put the information about which encoding was
> used into the HTTP response -- as the Validator does -- it can be
> unambigously detected by browsers.
> 
> But when you put it /inside/ the very data whose encoding you need to know,
> you end up with a nasty little Catch-22. The browser can¹t, strictly, find
> the bit that says ³<meta ...;charset=UTF-8>² because that bit of text
> exists only as uninterpreted raw bytes. To find it, you need to engage in
> some pretty hairy guesswork.
> 
> The reason this appears to ³work² most of the time is that US-ASCII,
> ISO-Latin-1, and UTF-8 are all proper subsets of each other; and in the
> western world and historically on the Internet, these encodings cover all
> the characters needed. In a truly international Internet, with more and
> more Chinese, Thai, Vietnamese, Russian, Latin American, etc. people coming
> online, these assumptions are no longer valid.
> 
> 
> Netscape 4.x is the most recently released browser to have problems with
> UTF-8 (that I know of). Lynx 2.6 predates it by quite a bit IIRC.
> 
> OTOH, always spitting out UTF-8 is not a good thing to do. At some future
> point, the Validator may start paying attention to the Accept-Charset
> header the browser sends, and return a document in one of the encodings
> requested by the browser. But given the small magnitude of the problem and
> the effort involved in achieveing the feature, it probably won¹t be any
> time very soon.


	Terje,

	Very many thanks for replying with a detailed response.  I shall
	read-up on this area (³How-To define character encodings²) and
	then re-read your reply.

	I was aware that UTF-8 had its Latin-8/ISO-8859 characters in
	the same positions as those 7-bit sets.  And yes, I had wondered
	whether there was a way of testing for a browser's support of
	UTF-8 and, if not supported, to return the same content declared
	as ISO-8859.  I fully take your point about the increasing
	internationalisation and questionableness of perpetuating Latin/
	ISO defaults.

	On this X server, all of the X11 PCF fonts I use are ISO-8859-1.
	I have yet to find replacements for some, let alone all, of the
	X fonts I have.  Likely, I never will; I will always have to
	live with fonts that use the older, deprecated, method of
	supporting symbol sets rather than the exhaustive method employed
	by UTF.  Validator's use of UTF-8 is laudable, but I wonder
	whether one should only really do so if one makes a concession
	to the past, tradition, and historicity, by returning an ISO
	encoded doc.

	I'll stop opining now ahead of learning rather more about the
	«subject». ;-)

	Thank you for taking the time to explain; I'm grateful.

	Best regards,
	MJ.
Received on Monday, 21 January 2002 17:48:51 UTC