Re: [Techniques] Draft General Technique for GL 3.1 L2 SC3 from Michele Diodati on 2004-12-30 (w3c-wai-gl@w3.org from October to December 2004)

From: Michele Diodati <michele.diodati@gmail.com>
Date: Thu, 30 Dec 2004 18:04:45 +0100
To: w3c-wai-gl@w3.org
Message-ID: <2e1e87c041230090411d12a83@mail.gmail.com>
Hi Gregg.

> 1)  I agree there are pronunciation problems with AT. If we can find good
> practical methods for addressing this we will. It currently isnt required
> at level 2. if you have ideas how to do this in a practical way - we would
> love to hear them.

I think the main thing you as WAI should do is to hardly press
assistive technologies producers until they develop always more
intelligent applications. Unfortunately web developers are in most
cases only... web developers. They are simply and totally unaware of
the many lexical and syntactic ambiguities inside the texts they daily
publish on the Web. Despite their possible best will to respect WCAG
requirements about changes in natural languages, they don't master
this subject enough to disambiguate all the numerous instances that
can occur, for example, in Italian technical or advertising documents.
Moreover, frequently they haven't enough time to carefully read and
consider from a linguistic viewpoint the content they are assembling
for the Web.

Having said that, the request "The meanings and pronunciations of all
words in the content can be programmatically located" [1] appears
really inapplicable, at least as regards pronunciations. Italian web
sites (I'm sorry if I refer all the time to Italian situation, but my
experience is about that) are literally replete of English words and
phrases, scattered all over the textual content. You can find into
them an _endless_ series of "home page" (instead of "prima pagina" or
"pagina principale"), "download" (instead of "scarica"), "account",
"login", "best viewed", "compliance", "effort", "password", "trial",
"trailer", "bottom up", "top page", "signature", "conference call",
"call center", "brainstorming", "briefing", "jogging", "fitness",
"future", and hundreds and hundreds of other English words and
phrases, ordinarily used into web pages having Italian as their main
language.

Many of these thousands of English words are "standard extension" of
Italian language: you can find them in the latest Italian
dictionaries. Many more of them are not. Almost all of them are
mispronounced when read by assistive technologies (for example, Jaws
4.50 reads "download" inside Italian web pages in a very
incomprehensible manner). Do you really think Italian web developers,
while trying to comply with WCAG 2.0 L2 SCs for GL 3.1 (or with WCAG
1.0 4.1 checkpoint), will open _millions_ of existing web pages,
searching for _billions_ of instances of English words and phrases, to
make them accessible as for their pronunciation? It is a dramatically
unmanageable burden!

This is a cultural issue before it is an accessibility issue. Italian
authors' habit of using a bulk of English words and phrases within
texts mainly written in Italian is a sort of fashion, probably due to
a certain cultural subjection against everything comes from US. This
habit is very harmful from the accessibility viewpoint: this sometimes
ridiculous soup of English/Italian words lowers the comprehensibility
of content for people with low levels of schooling, and make very and
vainly difficult, for people using a speech synthesizer, catching the
right word spoken by the machine, especially when English words are
inserted just in the middle of Italian sentences, in places where a
listener would rather expect to find Italian words.

Nevertheless such a habit exists, and is widespread. A realistic
approach to the changes of language issue within WCAG have to take
into account this reality. In a context as the Italian web community,
it isn't realistic the claim that web developers have enough time and
ability to carry out, for each document published on the Web, a
gruelling work, consisting of a meticulous reading of all the content,
followed by recognition and appropriate indication of all foreign
words and phrases used in the text.

I think it would be much more useful if WCAG 2.0, instead of requiring
that web developers mark appropriately _each_ foreign word or passage
inserted in the content, simply warned authors to slash using foreign
words and phrases when a satisfactory and valid alternative exists, if
only they condescended to use appropriately the main language of the
document. By the way, for all the English words I cited above,
normally used in Italian web pages, there is plenty of valid Italian
alternatives...

>> 2. The present separation in L2 SC3, between words included and not included
>> in dictionaries, does not give a valid solution for a lot of situations
>> arising from intrinsic ambuiguity and complexity of natural languages.
> 
> GV  Not sure I follow. The rule in #1 is that a dictionary be attached for
> all words in the content. And definitions be created for custom words that
> are not in any dictionary.

Dictionaries can address the issue of the meaning of the words used in
the content, only when the user has understood the pronunciation of
those words. On the contrary, they can't address the issue of
understanding single words, for example when they are mispronounced by
assistive technologies. If a user isn't able to understand a word,
both because the word is mispronounced or because he isn't able to
catch its correct foreign pronunciation, there remains a hole, for the
user, in the speech he is listening and trying to understand. What
good is the dictionary, when I don't know what I am looking for?
Sometimes you can listen again many times the same sentence without
being able to understand what the hell is that single word in the
middle of the sentence, fundamental for its comprehension. The more
foreign words are inserted into sentences written in the main language
of the document, the greater will be the risk that the listener will
not be able to understand all the words in the content, and probably
the whole meaning of the text.

>> 4. The identification of the natural language of each block of text in a web
>> page should be delegated to user agents.
> 
> GV  For larger blocks of text I think this may be workable. Little phrases
> could be harder. I would like to see this handled by User agents.  If they
> can (without markup) then the language of the words would be
> "programmatically determined" without needing markup. 

I don't see a difference in principle between the ability to recognize
a single word as English in the middle of sentence written in French
or Italian, and the ability to determine that a whole sentence is
written in a given language, whatever it is. If an assistive
technology has a series of built-in dictionaries and speech engines
large enough to understand that "home page" is English and "prima
pagina" is Italian, it has all the necessary to pronounce those words
according to the phonetic rules of the respective languages, and it
doesn't matter whether these words are isolated within a text written
in a different language or they are part of a block of text all
written in the same language.

For example, a browser UTF-8 compliant can show alongside in the same
web page sentences written in many different languages, different
alphabets, different text directions without any specific markup
differentiating a run of text from all the others [2].Working on the
basis of universality granted by Unicode, assistive technologies could
easily develop the ability to determine which language a run of text
is, and pronounce it according to the phonetic rules of the given
language. In case of omographs shared between more than a natural
language, they could develop suitable algorithms to choose from the
context the more likely between all possible pronunciations.

> Do you know of tools
> that can do this?  Esp if they are publicly available.

At the moment I don't know any tool capable of so an advanced level of
automatic linguistic recognition. Anyway, in my opinion the present
inadequacy of technologies isn't a valid reason to put on web
developers a burden totally unproportioned to the effective
competences of the vast majority of them.

Best regards,
Michele Diodati
--
http://www.diodati.org

[1] <http://www.w3.org/TR/WCAG20/#meaning-prog-located>.

[2] Here is a working example: <http://www.columbia.edu/kermit/utf8.html>.
Received on Thursday, 30 December 2004 17:05:17 UTC