Re: Priority for Techniques Dealing with Foreign Language Markup

Comments interspersed - look for CMcCN:: or AJF::

On Wed, 18 Nov 1998, Alan J. Flavell wrote:

> HTML only has one document character set: Unicode.

CMcCN::
Well, maybe. But most documents have a variety of character sets - 
ISO-8859-1, or Shift-JIS, or Windows-1252, or ISO-8859-5, or EUC-2022-KR 
or whatever. Having these marked up would be helpful, but is not 
necessarily sufficient to guess the language.

AJF::
> Well, alright, there are some misguided pseudo-HTML documents that 
> are made that way (I'm thinking most particularly of documents that
> go FONT FACE="Symbol" and then expect their Roman letters to be
> displayed as Greek, but the same has been seen with other alphabets).
> But these are not well-formed WWW documents, surely the WAI does not
> have to devise ways of displaying them?
> 
> HTML documents use quite a number of different encodings (designated
> by that confusingly-named "charset" parameter on the content-type
> header), but every properly-transmitted document has this "charset"
> explicitly stated (except for pre-HTML4.0 documents in iso-8859-1,
> where the charset attribute is optional). 
> 
> Now, I have to admit I am entirely unfamiliar with how a screen reader
> would deal with this, but I firmly feel that whatever it does, it has
> to be based on a proper recognition of the interworking protocols.
> 
> > With a hint from the origin of the document, it gets a little easier.
> 
> Technically, the language of the content and its character encoding
> are two unrelated issues.  (Even if that seems unrealistic and
> impractical, I'd say that trying to take any other view leads to far
> too many anomalies). 

CMcCN::
(what I said above...)

AJF::
> 
> And knowing that a document is in iso-8859-1 does not help to know how
> to pronounce the document if one does not know whether it is
> Icelandic, Gaelic, Portugese...
> 
> Nor would it be more than an unpleasant kludge to take a stab at the
> language based on the DNS name.
> 

CMcCN::
Agreed. But it can be a hint, ad is important information. More below...

AJF::
> Anyway, a solution for a site which has been constructed without
> explicit content language specifications would seem straightforward:
> simply arrange for the server to send out an HTTP content-language
> header. It needs no editing of the web pages themselves (if documents
> are available in various languages, some action may be needed to
> identify them, e.g by appropriate choice of filename - see Apache's
> Multiviews for ideas).  The meaning of the HTTP content-language
> header is subtly different from language specifications within an HTML
> document, it's true, but I'd argue that either solution would be
> serviceable, for individual documents that are in a single language.
> 
> I suppose I'll get told that the existing client agents don't do
> anything with this HTTP header, so it doesn't help them with their
> rendering. That would be unfortunate, as this is a bona fide part of
> the protocol.
> 

CMcCN::
No, that's not my big complaint here (although it is in the User Agent 
group). My big complaint is that most authors do not have the ability to 
set up how their server deals with language negotiation, but they do have 
the ability through a combination of META HTTP-EQUIV elements (I have 
written on this topic here a couple of months ago) and LANG="xx" 
statements, to make it explicit in their pages. It is important that it 
be explicit, and either we try changing the way ISPs work (which seems 
unlikely and not the most efficient place to deal with it anyway - 
authors know better what language they write) or we change the way 
authors write, by asking them to mark up their language explicitly.

Cheers

Charles McCathieNevile

Received on Wednesday, 18 November 1998 13:31:42 UTC