Re: [WebAIM] Lang attribute and "old" latin from Jukka K. Korpela on 2008-04-25 (w3c-wai-ig@w3.org from April to June 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Fri, 25 Apr 2008 09:11:52 +0300
To: "WebAIM Discussion List" <webaim-forum@list.webaim.org>, <gawds_discuss@yahoogroups.com>, <w3c-wai-ig@w3.org>
Message-ID: <007701c8a69b$49112060$0500000a@DOCENDO>
John Foliot - Stanford Online Accessibility Program wrote:

> As far as I know, current screen reading technology only supports a
> limited number of languages.

Rather limited, I'm afraid. Moreover, support to language switching on 
the basis of language markup (lang or xml:lang attributes) is much more 
limited.

In practical terms, using language markup at the top level (<html> or 
<body> element) is a good move: it takes a very small effort, and it 
helps some people. (But then it should be _correct_. It often isn't, so 
e.g. Google does not use the information.)

Using language markup at other markup levels, e.g. for individual 
paragraphs or even words, is rather pointless, sad to say. There isn't 
much support worth mentioning. (I use it, but mostly as a matter of 
principle, or habit, and not very consistently. Many W3C pages, 
including pages that declare that it should be used, don't use it. Most 
web pages don't even make a try, so what motivation is there for 
software developers to support it?)

That's the big picture. In details, there's a lot that could be said, 
especially about the problems, but this doesn't seem to be an 
interesting topic to most people. However, mostly for "academic" 
interest, I'll comment on your specific issues:

> I am in the process of reviewing a number of web documents that
> feature, in part, a fair bit of "old Latin" (circa 13th century -
> it's a cool academic project).

I took "old" Latin as referring to pre-classic Latin... Anyway, there's 
no useful standardized way to distinguish between different forms of 
Latin in language codes. You could use country codes, e.g. "la-GB" to 
refer to Latin as used in the United Kingdom, but this would be 
anachronistic for 13th century language and also useless.

> At any rate, W3C guidance states
> "Clearly identify changes in the natural language of a document's
> text and any text equivalents (e.g., captions)."

I'm afraid nobody, including the W3C, takes that seriously. It's just 
too much trouble with little if any tangible benefit. It's based on 
theoretical ideas - largely, law, poorly analyzed ideas - on the 
_possible_ usefuless of language markup, rather than actual experience.

> *AND* the ISO code
> for Latin is either "LA" (ISO 639-1) or "LAT" (ISO 639-2) so clearly
> this *CAN* be done.

The technically correct language code for use in markup is "la", with 
lowercase as the recommended spelling. HTML and XML specifications refer 
to specifications that mandate the use of two-letter codes for languages 
that have one.

> As well, wikipedia suggests that "Screen readers without Unicode
> support will read a character outside Latin-1 as a question mark,

Character support is a different issue and should not depend on language 
markup, and mostly doesn't.

Generally, in special software like screen readers or specialized 
browsers, we should expect character support to be more restricted than 
in common modern browsers. Even Latin-1 isn't as safe as in "normal" 
browsing. For example, what would a screen reader do upon encountering a 
special character like " ¶"? Would it recognize it as having a special 
meaning (paragraph separator) and make a pause? Hardly. It probably 
spells it out. This might mean saying "pilcrow sign", perhaps 
independently of language being used (since characters names aren't 
widely localized - most characters don't even _have_ a name in most 
languages), which might be complete gibberish even to people who 
understand normal English.

> The question is, is there any real advantage gained by adding this
> information (lang="lat") to the content?

Very little if at all. But if used, it should be lang="la".

> I am at a loss to explain any real value
> in doing it to the client as at the end of the day I cannot myself
> find a "real justification" that would improve the accessibility of
> the document.

The best explanation that I could use (if someone offered to pay me for 
adding such markup and I needed to soup up "internal" and "moral" 
motivation) is the following (and it's lame, so this tells a lot):

If a user opens your HTML page in a word processor like Microsoft Word, 
it will use the language markup, and this can be relevant when spelling 
checks are "on", i.e. words classified as misspelled are highlighted. 
Declaring Latin words as Latin prevents the program from applying 
English spelling rules to them. (The copy of Word I just tested seems to 
be Latin-ignorant. That is, it recognizes the words being in Latin but 
does not flag anything as misspelled and does not even hyphenate Latin 
words. But even this is probably better than treating them as English or 
some other language.)

On some browsers, like Firefox, the user can right-click on a word and 
get information about its language. Sometimes it is useful to know that 
a word is Latin. (But what are the odds that a user knows about such 
functionality?)

Style sheets, either page or user style sheets, could be used to style 
words in a particular language as different from others, using a 
selector like [lang="la"] or :lang(la). However, this does not work e.g. 
on IE 6, which does not recognize such selectors.

Moreover, some day some browsers or other software could make real use 
of the markup.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Friday, 25 April 2008 06:12:44 UTC