W3C home > Mailing lists > Public > w3c-wai-gl@w3.org > July to September 2014

Re: Usefulness of language annotations

From: Richard Ishida <ishida@w3.org>
Date: Fri, 29 Aug 2014 11:39:32 +0100
Message-ID: <54005864.4000904@w3.org>
To: "Jens O. Meiert" <jens@meiert.com>, W3C WAI GL <w3c-wai-gl@w3.org>
Hello Jens,

Here are some comments from me on your blog post to reply to the 
question "Hence, what data and evidence did I miss? And what else could, 
or should, we do?".

You say:

"Still, the W3C I18N Activity advises against using HTTP headers[1], at 
least alone: “Use language attributes rather than HTTP to declare the 
default language for text processing.” (There seem to be no strong 
reasons given, then, as the language declarations document referenced is 
rather neutral about HTTP headers.)"

[1] http://www.w3.org/TR/i18n-html-tech-lang/#overall

[2] http://www.w3.org/International/questions/qa-http-and-lang#http

Actually the reasons are given if you follow the link to more detail 
from [1], and from there the link to 
http://www.w3.org/International/questions/qa-http-and-lang. One major 
issue is that the http information is not available if the page is saved 
to disk, read from a CD, processed by XSLT, AJAX, etc. So you really 
need to use @lang in anticipation of those situations - but if you're 
using @lang as a backup for those situations, then that certainly 
reduces efficiency to declare language on the server too. Another is 
that many authors can't access server settings easily, or at all. 
Another is that there is a potential for confusion because the 
Content-Language allows for multiple-language values (and the results of 
that are non-interoperable per browsers) - this is valid for metadata 
about a resource, but not helpful for language processing. And of 
course, Content-Language doesn't set more than the overall default 
language for a page - when you need to be more specific about language 
change you'll need to use @lang. Bear in mind, also, that @lang on the 
html tag overrides the Content-Language information for all browsers, so 
ignoring @lang when it's provided automatically by an editor, say, could 
undo the value of an HTTP declaration. Also, there are tools that can be 
deployed on content while it is being written, such as spellcheckers and 
grammar checkers - I write plenty of multilingual documents where an 
authoring tool that recognises language changes saves a lot of 
frustrating false positives, however the language information needs to 
be in the document rather than on the server for that to work. Similar 
situations apply when working with language-specific styling in an 
editor that provides a wysiwyg interface.

Using @lang on the html or other elements, however, resolves all these 
issues, and in addition provides consistency between the way you mark up 
the default language of the page and language changes in document.

"Next, and here it gets more interesting, it is completely unclear what 
tools actually use the information of inner-document language changes. 
Granted, this may be a knowledge gap on my end—being corrected is one 
reason why I write all of this down—, but from what I’ve seen so far, 
what I specifically understand some services like Google not to be 
doing, and even from my fading memories testing assistive tools, there’s 
not a great value in marking up changes in language."

I and others on this list already sent answers this question. See 
http://lists.w3.org/Archives/Public/w3c-wai-gl/2014JulSep/0137.html and 
following emails in that thread. See also 

"On my mind—and matching Google doctrine—, language detection should be 
automated. It should be a software responsibility."

There may be places where auto-detection of language helps, but it 
becomes more and more difficult to do the smaller the sample available. 
There are also possibilities of ambiguity in phrases such as "The French 
for bread is pain.", which auto-detection is not at all likely to 
detect, but where you may still want as an author to prevent spell-check 
errors or incorrect voice browser renderings. It is also unlikely that 
you'll see ubiquitous deployment of those services for all the places 
HTML is used, and it's certainly not currently ubiquitously available, 
so stopping to use @lang now is rather premature.

I think every page should specify a default language (in the html tag) 
if only because it future-proofs the page for new language-specific 
technologies which are currently on the horizon but which will soon be 
commonplace (many such are on the way with CSS3, others will come with 
language technology developments, etc...).  On the other hand, 
identification of language changes should be based on the author's 
expection of usefulness. For example, will it help with spell-checking, 
with styling, with identification of fragments, with text-to-speech, 
etc. There's certainly no need to mark up words like 'status quo'. If 
you don't want to mark up every change, that may be ok, but note that I 
think it's far easier to mark up all significant language changes than 
to each time debate with yourself where an automated approach will or 
won't succeed.

I have to say that I don't agree with your starting point that use of 
@lang poses significant problems of efficiency for content developers. I 
don't think it's problematic any more than a bunch of other attributes 
you would add as a matter of course to provide useful semantics in your 
markup. In fact, it's pretty easy to add, as markup goes.

One last thought: I think it would help to separate the discussion 
around the ideas of default language for the page (html @lang or 
Content-Language), and markup of language changes, since different 
criteria apply on the whole*.

Hope that helps,

* It's a slightly less clear distinction in the case of a multilingual 
document that uses different languages for large sections of a document, 
such as a French Canadian page that puts English on the left and French 
on the right, but it holds generally.

On 25/08/2014 16:42, Jens O. Meiert wrote:
> For who’s is interested in the topic, I’ve presented my view again in
> a different form and in more detail under
> http://meiert.com/en/blog/20140825/html-and-language/.
> TL;DR: Compared to current practices, it seems Content-Language could
> be preferred over @lang to denote document language, and—more
> importantly—detecting changes in language should probably be made a
> software responsibility.
> To avoid parallel list discussions (my bad) it may be more convenient
> to collect counter-arguments on the post.
Received on Friday, 29 August 2014 10:40:01 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 21:07:56 UTC