- From: Alan J. Flavell <flavell@a5.ph.gla.ac.uk>
- Date: Wed, 18 Nov 1998 19:06:10 +0000 (GMT)
- To: Charles McCathieNevile <charlesn@srl.rmit.EDU.AU>
- cc: "'GL - WAI Guidelines WG'" <w3c-wai-gl@w3.org>
On Thu, 19 Nov 1998, Charles McCathieNevile wrote: > On Wed, 18 Nov 1998, Alan J. Flavell wrote: > > > HTML only has one document character set: Unicode. > > CMcCN:: > Well, maybe. But most documents have a variety of character sets - > ISO-8859-1, or Shift-JIS, or Windows-1252, or ISO-8859-5, or EUC-2022-KR > or whatever. I'm sorry, the point I was trying to make is that these are encodings, in the language of HTML4.0. I'm sorry if this appears to be unreasonably pedantic, but there seems to me to be much confusion about this area of i18n, and I think it's worthwhile to strive for clarity when discussing it. 5.2 in the HTML4.0 spec has some useful remarks: http://www.w3.org/TR/REC-html40/charset.html#h-5.2 Please excuse me if this is thought excessive, but I think it may be useful to quote a paragraph from 5.2.1, as follows --quote begins-- Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding. --quote ends-- In simple cases it may be that the document doesn't utilise any characters that are outside of the repertoire of the encoding ("charset") that it uses: but it's perfectly valid for the document to contain some &entity; or &#bignumber; representations that lie outside of the repertoire that's defined by the document's encoding. To take a simple example, a document that's in a Cyrillic encoding, let's say koi8-r, can still validly include French or German employing é ü and so forth, while a document that's in iso-8859-1 can validly contain &#bignumber; references that represent Cyrillic characters. Which of the two representations to choose for, say, a bi-lingual document would be dictated by practical convenience: either is a valid document according to RFC2070 or HTML4.0. Provided that the reader is using a client agent that supports RFC2070 to this extent, the document will be displayed correctly. > AJF:: > > Anyway, a solution for a site which has been constructed without > > explicit content language specifications would seem straightforward: > > simply arrange for the server to send out an HTTP content-language > > header. It needs no editing of the web pages themselves > CMcCN:: > My big complaint is that most authors do not have the ability to > set up how their server deals with language negotiation, but they do have > the ability through a combination of META HTTP-EQUIV elements (I have > written on this topic here a couple of months ago) and LANG="xx" > statements, to make it explicit in their pages. I'm sorry for not making my reasoning clear. I was referring implicitly to an argument elsewhere on this thread, that were some large site had already been created without language attributes in its markup, then it might be impractical to correct that. Well, processing a whole collection of files to do nothing more than change <HTML> into <HTML LANG="value"> for some fixed "value" is hardly rocket science, but I was suggesting an alternative solution that could be applied without editing the files, if that were preferred. > It is important that it > be explicit, and either we try changing the way ISPs work (which seems > unlikely and not the most efficient place to deal with it anyway - I find this very sad: the HTTP protocol has many valuable features, it's a tragedy that it's being crippled in this way. And the most popular server, Apache, has no difficulty putting these matters into the hands of the document owners via their .htaccess files. But you could well be right that it's impractical to expect this part of the WWW to work as designed. > authors know better what language they write) Yes, that much is true enough, I have no dispute with that. I'm sorry, I'm rather conscious that this has addressed issues that are relevant to i18n in general, and not particularly specific to accessibility. However, they are issues that can have much more critical consequences in an accessibility context, so I thought it was worth trying to clarify the issues. all the best
Received on Wednesday, 18 November 1998 14:07:00 UTC