Re: meta content-language from Martin Duerst on 2008-08-27 (www-international@w3.org from July to September 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 27 Aug 2008 12:12:50 +0900
To: "Mark Davis" <mark.davis@icu-project.org>
Cc: "Julian Reschke" <julian.reschke@gmx.de>, "Leif Halvard Silli" <lhs@malform.no>, "Ian Hickson" <ian@hixie.ch>, "HTML WG" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>
Message-Id: <6.0.0.20.2.20080826104251.081d1608@localhost>
Hello Mark,

At 01:02 08/08/26, Mark Davis wrote:

>Mark
>
>
>On Mon, Aug 25, 2008 at 1:49 AM, Martin Duerst <<mailto:duerst@it.aoyama.ac.jp>duerst@it.aoyama.ac.jp> wrote:
>>At 23:04 08/08/22, Mark Davis wrote:

>>>1. Distinction in Language. Should there be a distinction in interpretation between the language set via lang attribute and meta content?
>>>
>>><html lang="foo">
>>>and
>>><meta http-equiv="Content-Language" content="foo"/>
>>>
>>>My take is that any such distinction would be a departure from current practice, and too fine a distinction for the vast majority of people to be able to follow.
>>
>>Such a distinction IS current practice. The former can only
>>contain one language, the later can contain a priority list.
>
>True, there is that difference (as I noted below). And it is an unfortunate one.

Please avoid such judgement-only statements.


>But when there is a single language in both cases, the question is whether there is an established difference in semantics.

This is totally the wrong question. If there is only a single
language, everything falls together. Distinctions can be made,
but they are more theoretical than practical.

According to the priorities already mentioned, HTTP Content-Language
serves as a fallback for language information inside the document,
and so a single setting of HTTP Content-Language is supposed to
be sufficient for the case of monolingual documents.


>>Also, the former is used on the browser side or by editing tools,
>>whereas the later is used by the server side (see e.g. the
>>examples that Roy gave).
>
>Are you really sure that the latter is only used by the server side, never by browsers? It would be interesting to see the evidence you base this on.

Where did I say "never by browsers"? Richard's tests show that there
is a certain level of usage on the browser side, although less that
for lang/xml:lang.


>>As for "too fine a distinction for the vast majority of people
>>to be able to follow", the people that we need to follow this
>>distinction are Web page/contents creators for pages with
>>multilingual content.
>>
>>The distinction is clearly given at
>><http://www.w3.org/International/tutorials/language-decl/#Slide0060>http://www.w3.org/International/tutorials/language-decl/#Slide0060.
>>If you think this is too difficult, and can be improved upon,
>>please tell us why/how.
>
>First, are you saying that this semantic distinction is present in the base standards defining those fields, and not just in these tutorials (to use your terms, "ab initio")? If so, I've missed it.

Where did you look? The first sentence of section 14.12 of RFC 2616,
the current HTTP spec (http://www.ietf.org/rfc/rfc2616.txt), says:

   The Content-Language entity-header field describes the natural
   language(s) of the intended audience for the enclosed entity. Note
   that this might not be equivalent to all the languages used within
   the entity-body.

Its predecessor, RFC 2068, now over 10 years old, says exactly
the same in Section 14.13, although using examples rather than
a definition-like statement.

HTML4 (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
mentions many usages related to language-specific rendering and
processing. Much of it goes back to http://www.ietf.org/rfc/rfc2070.txt
(see section 3), although that didn't mention spell checking.
[XML is more vague given the more abstract target and writting
style of that spec.]

While all of the usages mentioned in HTML4 and elsewhere are
valid, it is important to understand that the decisive driver
for including lang into HTML4, and for including xml:lang into
XML, was the perceived (by some quarters) inadequacy of Unicode
with respect to language-specific rendering (e.g. differences
between representative glyphs for some Unified Han Ideographs
with respect to Chinese/Japanese/..., or some typical rendering
differences for some characters between e.g. Serbian and Russian,...).


>Or are you saying that this is established practice, in which case I'd like to see the evidence for that.

Roy has mentioned how Content-Language is used for intended audience
via language negotiation. The W3C Web site definitely has examples
of this, as has the Apache Web site. Especially at W3C, there are
quite a few examples of pages that are language-negotiated
(intended audience) but contain snippets in languages not
included in the Content-Language header (or possibly its meta
equivalent).


>I think it is a hopeless task to distinguish between these semantics. It is hard enough to get people to say that a document is in Japanese (correctly), let alone to get them to make the very fine point that this document is in Japanese, but intended for French readers.

Is this a hypothetical example (in which case, please bring up a
better one) or an actual example (in which case, please point to
the actual document)? In general, in order for a document to be
targeted at readers of a certain language, a significant portion
of that document will be in  that particular language. A typical
example would be the various language variants served at/accessible
from
http://www.w3.org/International/articles/inline-bidi-markup/Overview.php.

I think that authors of multilingual documents are well aware,
probably more implicitly than even explicitly, of the function
of each of their text pieces, and therefore of things such as
intended audience and processing language. The problem isn't
that such authors can't make this distinction, the problem is
much more that the available tools don't make it easy enough
to provide the information. On the HTTP (intended audience)
side, we have the well-known problems of metadata server setup;
on the @lang/@xml:lang side, we have the problem of editing
tools not providing easy enough ways to specify language for
pieces of text, and of not providing enough feedbacks and benefits
to make it worthwhile and eliminate most mistakes (the oft-cited
MS Word spell-checking being the laudable exception here).


>That is far, far too fine a distinction for people to reliably make, even such knowledgeable people as "Web page/contents creators for pages with multilingual content".

I think I have to strongly disagree here. As said above,
authors of multilingual content are implicitly aware of the
distinction. They just have to be prompted better, and
outreach work such as Richard's article is one way to start.


>The article has the feeling of trying to retrofit an existing situation in making a distinction

Again, wrong, as I showed above by citing the HTTP spec.



>that frankly, is an exceedingly small percentage of even multilingual contents,

I guess you are wrong here, too. Multilingual content ranges from
a basically monolingual page with just a few snippets in (a) different
language(s) to parallel texts. For purely parallel texts, the
distinction can be said to be irrelevant, but in my understanding,
purely parallel texts form a clear minority of multilingual content.
That would mean that the distinction is very relevant for the
majority of multilingual content.


>and will (I predict) never be followed with any reliability.

My guess is that it will be followed with about as reliability
as language tagging itself. Which I guess we both agree isn't very
high, unfortunately.


>>i.e.
>>we can say <p lang='en'>He said "<span lang='fr'>Oui</span>"</p>.
>
>While we "can" say that, the question is how many do? What percentage of multilingual documents actually go do the trouble of marking each and every language run? Take a wild guess, and we'll see how accurate you are.

Again, like everything in language tagging, the answer is "not very many".
For "each and every run" the rate is clearly going to be extremely low,
because it's closer to asking "How many large collections of documents
have each and every document language-tagged correctly" than to ask
about the correct language tagging rate for single documents.
A more equivalent question might be: What percentage (in terms of
character count) of multilingual documents are tagged correctly.
Given that a lot of multilingual documents are documents in a single
language with very little additional content in another language,
my guess is that this percentage is not very far away from the
same rate for monolingual documents.


>What I was saying is that where these tags are used, some information is better than none.

Here we get back to the "it depends on the application". For some
applications, "some information" is clearly better than nothing,
for other applications, a choice of languages is just as bad as
no language information.
As an example, if I know that some text is either French or English,
how should I spellcheck it? (it's possible to immagine spell-checkers
that take advantage of that information, but I don't know any that
would actually do so).

Also, the fact that many multilingual documents only contain small
snippets in other languages means that merging all appearing languages
at a point high up in the document structure may mean that there is
actually less (correct) information than if just the main language
were tagged.


>Given that people won't be tagging each and every run by the language, it is better to provide the information at the top that there is substantial content in X languages.

There is a high correlation between "target audience language"
and "substantial content language". So overall, the situation
isn't as bad as it may look.



>>And given that most XML applications (e.g. XSLT) have difficulties
>>to handle even simple language information correctly, it doesn't
>>seem a good idea to bother applications with something more
>>complicated.

This is a point where I still have to see some response from you.
Do you disagree? Do you think it's irrelevant? Or what?

Regards,    Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 27 August 2008 03:17:20 UTC