Re: Priority for Techniques Dealing with Foreign Language Markup from Alan J. Flavell on 1998-11-18 (w3c-wai-gl@w3.org from October to December 1998)

From: Alan J. Flavell <flavell@a5.ph.gla.ac.uk>
Date: Wed, 18 Nov 1998 19:06:10 +0000 (GMT)
To: Charles McCathieNevile <charlesn@srl.rmit.EDU.AU>
cc: "'GL - WAI Guidelines WG'" <w3c-wai-gl@w3.org>
Message-ID: <Pine.OSF.3.96.981118183614.14808G-100000@a5.ph.gla.ac.uk>
On Thu, 19 Nov 1998, Charles McCathieNevile wrote:

> On Wed, 18 Nov 1998, Alan J. Flavell wrote:
> 
> > HTML only has one document character set: Unicode.
> 
> CMcCN::
> Well, maybe. But most documents have a variety of character sets - 
> ISO-8859-1, or Shift-JIS, or Windows-1252, or ISO-8859-5, or EUC-2022-KR 
> or whatever. 

I'm sorry, the point I was trying to make is that these are encodings,
in the language of HTML4.0.  I'm sorry if this appears to be
unreasonably pedantic, but there seems to me to be much confusion
about this area of i18n, and I think it's worthwhile to strive for
clarity when discussing it.  5.2 in the HTML4.0 spec has some useful
remarks: http://www.w3.org/TR/REC-html40/charset.html#h-5.2

Please excuse me if this is thought excessive, but I think it may be
useful to quote a paragraph from 5.2.1, as follows

--quote begins--

 Authoring tools (e.g., text editors) may encode HTML documents in
 the character encoding of their choice, and the choice largely
 depends on the conventions used by the system software. These tools
 may employ any convenient encoding that covers most of the
 characters contained in the document, provided the encoding is
 correctly labeled. Occasional characters that fall outside this
 encoding may still be represented by character references. These
 always refer to the document character set, not the character
 encoding.

--quote ends--

In simple cases it may be that the document doesn't utilise any
characters that are outside of the repertoire of the encoding
("charset") that it uses:  but it's perfectly valid for the document
to contain some &entity; or &#bignumber; representations that lie
outside of the repertoire that's defined by the document's encoding. 
To take a simple example, a document that's in a Cyrillic encoding,
let's say koi8-r, can still validly include French or German employing
&eacute;  &uuml; and so forth, while a document that's in iso-8859-1
can validly contain &#bignumber; references that represent Cyrillic
characters.  Which of the two representations to choose for, say, a 
bi-lingual document would be dictated by practical convenience: either
is a valid document according to RFC2070 or HTML4.0.

Provided that the reader is using a client agent that supports RFC2070
to this extent, the document will be displayed correctly. 

> AJF::
> > Anyway, a solution for a site which has been constructed without
> > explicit content language specifications would seem straightforward:
> > simply arrange for the server to send out an HTTP content-language
> > header. It needs no editing of the web pages themselves 

> CMcCN::
> My big complaint is that most authors do not have the ability to 
> set up how their server deals with language negotiation, but they do have 
> the ability through a combination of META HTTP-EQUIV elements (I have 
> written on this topic here a couple of months ago) and LANG="xx" 
> statements, to make it explicit in their pages.

I'm sorry for not making my reasoning clear.  I was referring
implicitly to an argument elsewhere on this thread, that were some
large site had already been created without language attributes in its
markup, then it might be impractical to correct that.

Well, processing a whole collection of files to do nothing more than
change <HTML> into <HTML LANG="value"> for some fixed "value" is
hardly rocket science, but I was suggesting an alternative solution
that could be applied without editing the files, if that were
preferred. 

> It is important that it 
> be explicit, and either we try changing the way ISPs work (which seems 
> unlikely and not the most efficient place to deal with it anyway - 

I find this very sad: the HTTP protocol has many valuable features,
it's a tragedy that it's being crippled in this way.  And the most
popular server, Apache, has no difficulty putting these matters into
the hands of the document owners via their .htaccess files.  But you
could well be right that it's impractical to expect this part of the
WWW to work as designed. 

> authors know better what language they write)

Yes, that much is true enough, I have no dispute with that.

I'm sorry, I'm rather conscious that this has addressed issues that
are relevant to i18n in general, and not particularly specific to
accessibility.  However, they are issues that can have much more
critical consequences in an accessibility context, so I thought it
was worth trying to clarify the issues.

all the best
Received on Wednesday, 18 November 1998 14:07:00 UTC