Re: http://validator.w3.org/feedback.html

From: Terje Bless <link@pobox.com>
Date: Mon, 21 Jan 2002 18:20:19 +0100
To: Michael Jackson <michael.jackson@bcs.org.uk>
cc: Sascha Claus <SC_LE@gmx.de>, Liam Quinn <liam@htmlhelp.com>, www-validator@w3.org
Message-ID: <20020121182020-d01050007-652a923b-1012-010c@>
Michael Jackson <michael.jackson@bcs.org.uk> <fi57@bcs.org.uk> wrote:

> So Columbia place a 
><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
> declaration in the html head section for the page and have
> the server simply return a
>Content-Type: text/html
> Whereas, Validator places no form of charset declaration
> in the html head section for the page but instead has the
> server return with
>Content-Type: text/html; charset=utf-8
> Presumably these are different ways of achieving the same
> thing.
> The problem for me is not understanding the relationship
> (if any) between what `lynx-dump -head` shows for Content-
> Type, and what is coded in the page's html head section.
> ...and therefore, why one would want to do it (it being I
> presume achieving the same thing) the way that Validator
> does it, rather than the way that Columbia does it even
> though `the Columbia way' works on old browsers.

No, the whole point is that the Columbia way does _not_ work. If it had
worked, your Lynx would have ceased to ³work². See? :-)

There are a couple of things going on here. One is that both these two
documents are encoded in ³UTF-8² -- a Character Encoding your version of
Lynx does not support -- but which is somewhat backwards compatible with
ISO-Latin-1 encoding (the previous de facto standard). Both servers tell
you that the documents they return are encoded in UTF-8, but only the
Validator does so in the proper way that Lynx recognizes.

So. When Lynx fetches the Columbia document, it finds no ³charset²
parameter in the HTTP response, and so it guesses ISO-Latin-1 (which /is/ a
supported encoding). This works due to the somewhat backward compatible
nature of UTF-8; only the ³strange² characters used are garbled, the common
alphabetic characters survive.

When Lynx goes to fetch the Validator document, it finds the encoding
³UTF-8² in the HTTP response. Since this is not a supported encoding, it
promptly punts and refuses to show the document at all.

The reason for doing it the ³Validator way² instead of the ³Columbia way²
is the same as for bringing your keys with you instead of locking them
inside the car! :-)

The Character Encoding tells the recieveing end -- in this case Lynx -- how
to interpret the raw bytes coming over the wire and how to turn them into
³characters² that can then be parsed for HTML markup and content. The HTTP
protocol has ³ISO-Latin-1² pre-defined as the Character Encoding, but the
contents of a HTML document can be encoded in one of any number of weird
and wonderfull ways. When you put the information about which encoding was
used into the HTTP response -- as the Validator does -- it can be
unambigously detected by browsers.

But when you put it /inside/ the very data whose encoding you need to know,
you end up with a nasty little Catch-22. The browser can¹t, strictly, find
the bit that says ³<meta ...;charset=UTF-8>² because that bit of text
exists only as uninterpreted raw bytes. To find it, you need to engage in
some pretty hairy guesswork.

The reason this appears to ³work² most of the time is that US-ASCII,
ISO-Latin-1, and UTF-8 are all proper subsets of each other; and in the
western world and historically on the Internet, these encodings cover all
the characters needed. In a truly international Internet, with more and
more Chinese, Thai, Vietnamese, Russian, Latin American, etc. people coming
online, these assumptions are no longer valid.

Netscape 4.x is the most recently released browser to have problems with
UTF-8 (that I know of). Lynx 2.6 predates it by quite a bit IIRC.

OTOH, always spitting out UTF-8 is not a good thing to do. At some future
point, the Validator may start paying attention to the Accept-Charset
header the browser sends, and return a document in one of the encodings
requested by the browser. But given the small magnitude of the problem and
the effort involved in achieveing the feature, it probably won¹t be any
time very soon.
