Re: Priority for Techniques Dealing with Foreign Language Markup

On Wed, 18 Nov 1998, Charles McCathieNevile wrote:

> Maybe the problem should be re-expressed into marking up the Character 
> set.

HTML only has one document character set: Unicode.

> I can read a bit of Japanese and Greek, but if the character set is 
> not marked, and the language is not marked, then I simply have to guess 
> at all the character sets I can think of. Then if I make a lucky guess, I 
> have to work out a font.

Excuse me, but this seems to be confusing HTML with some proprietary
word-processing format.

Well, alright, there are some misguided pseudo-HTML documents that 
are made that way (I'm thinking most particularly of documents that
go FONT FACE="Symbol" and then expect their Roman letters to be
displayed as Greek, but the same has been seen with other alphabets).
But these are not well-formed WWW documents, surely the WAI does not
have to devise ways of displaying them?

HTML documents use quite a number of different encodings (designated
by that confusingly-named "charset" parameter on the content-type
header), but every properly-transmitted document has this "charset"
explicitly stated (except for pre-HTML4.0 documents in iso-8859-1,
where the charset attribute is optional). 

Now, I have to admit I am entirely unfamiliar with how a screen reader
would deal with this, but I firmly feel that whatever it does, it has
to be based on a proper recognition of the interworking protocols.

> With a hint from the origin of the document, it gets a little easier.

Technically, the language of the content and its character encoding
are two unrelated issues.  (Even if that seems unrealistic and
impractical, I'd say that trying to take any other view leads to far
too many anomalies). 

And knowing that a document is in iso-8859-1 does not help to know how
to pronounce the document if one does not know whether it is
Icelandic, Gaelic, Portugese...

Nor would it be more than an unpleasant kludge to take a stab at the
language based on the DNS name.

Anyway, a solution for a site which has been constructed without
explicit content language specifications would seem straightforward:
simply arrange for the server to send out an HTTP content-language
header. It needs no editing of the web pages themselves (if documents
are available in various languages, some action may be needed to
identify them, e.g by appropriate choice of filename - see Apache's
Multiviews for ideas).  The meaning of the HTTP content-language
header is subtly different from language specifications within an HTML
document, it's true, but I'd argue that either solution would be
serviceable, for individual documents that are in a single language.

I suppose I'll get told that the existing client agents don't do
anything with this HTTP header, so it doesn't help them with their
rendering. That would be unfortunate, as this is a bona fide part of
the protocol.

best regards

Received on Wednesday, 18 November 1998 08:47:57 UTC