RE: meta content-language from Phillips, Addison on 2008-08-15 (public-i18n-core@w3.org from July to September 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Fri, 15 Aug 2008 11:36:09 -0700
To: Mark Davis <mark.davis@icu-project.org>
CC: Henri Sivonen <hsivonen@iki.fi>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, HTML WG <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014AD48376@EX-SEA5-D.ant.amazon.com>
(laughing) The “text processing language” has very little effect on how browsers process the text; hence there is little reward/punishment for using/abusing language tagging. Hence, a lot of Twi on the Web (the tag ‘tw’ for the language Twi obviously is supposed to mean “Taiwan”…)

So I don’t disagree with your analysis and I would be extremely surprised if matters improved very much any time soon. Sturgeon’s Law will still apply to Web content (and especially language tags) when we’re done with this work.

The main beneficiaries of good markup in the short term would be the page authors themselves, since authoring tools interact with stuff like spell- and grammar-checking that are strongly language influenced. And tools makers might very well provide better automagic tagging if they could improve their own features based on embedded tags.

There will also be improvements in accuracy if we can provide better support for using the language markup in other web technologies (such as recent work in CSS, RDF, usw.). While not as critical as character encoding to the user’s page experience, we would be giving experienced/careful page authors features that provide better (and more consistent results) and thus drive an incremental improvement of the Web at large. This may still not help Google ;-), but providing a consistent, useful, well-documented feature would still be a benefit to those that used it wisely.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On Behalf Of Mark Davis
Sent: Friday, August 15, 2008 11:14 AM
To: Phillips, Addison
Cc: Henri Sivonen; Richard Ishida; Ian Hickson; HTML WG; public-i18n-core@w3.org
Subject: Re: meta content-language

Unfortunately, at least in our experience at Google, the language tags are very inaccurately applied, either as representing the target audience or the language of individual segments of text. And even more often they are just simply missing. We also see many cases of contradictions between the tags using with different mechanisms (http vs meta). The situation is similar to that of encoding tags, which are also wrong enough that they cannot be relied on on.

There are important differences, however. The first is that the encoding is much more commonly used; a reasonably high percentage of pages have encoding tags, while a rather small percentage have language tags.

Secondly, if the encoding is wrong, users (especially of languages often encoded not in UTF-8 or Latin-1) are used to changing the encoding via a View>Text Encoding menu or equivalent. (Sadly, search engines don't have a horde of Mechanical Turks to do this step-- they have to do better than the combination of browser + user action, which is a pretty high bar!) Compare this with language tags, where there is typically no possible user menu to "change" the language of the document that s/he is viewing, let alone a menu to change the language of a selection (either sentence(s) or sentence fragment(s)).

Another difference, and perhaps the most important one, is that web page authors only tend to fix issues that result in noticeable, testable problems. The value of tagging language accurately is not visible for the huge majority of web pages, since in the vast majority of cases there is no immediate noticeable difference for users.

Probably the most noticeable affect, and indirect one, is that accurate language tagging could theoretically make a difference for placement in search engines. This is, however, definitely a chicken-and-egg problem. Because language tagging is so inaccurate and so often missing, the search engines need to do mechanical language detection anyway. Because search engines do mechanical language detection anyway, there isn't much of a need to do accurate language tagging because it has no noticeable effects!

The only place where it really would make a difference in practice is where mechanical detection has difficulties: in the few cases where there are languages that are quite close in terms of n-gram pairs and other characteristics commonly used for detection, such as Danish and Norwegian.

So while I am all for clarifying standards, I'm not sure that in this area it will have very much practical import.

Mark

On Fri, Aug 15, 2008 at 8:26 AM, Phillips, Addison <addison@amazon.com<mailto:addison@amazon.com>> wrote:
>
> The spec could make multiple language tags in Content-Language non-
> conforming and could make processing pick the first language tag.
In addition to being incompatible with existing Web content, I really don't see why we need to change the Content-Language meta tag from indicating the target audience to indicating the processing language. Since browsers don't make use of this information today for processing the text, we'd be better to make existing practice formalized than to change semantics.

>
> > 2. the meta approach is really not used by anything according to
> the
> > tests I
> > did
>
> Given that people do put and have put language declarations there,
> is it good to keep ignoring that data?
We don't have to ignore it. We can use that data for its most useful purpose, which is as metadata about the author's intentions (much like "keyword" was supposed to work).

>
> Of course, if the data is *wrong* significantly more often than
> lang='' (assuming that the correctness level of lang='' establishes
> an
> implicit data quality baseline), it would be good to ignore it. My
> guess is that HTTP-level Content-Language is more likely to be
> wrong
> (it sure is less obvious to diagnose) than any HTML-level
> declaration.
You could insert the never-ending saga of <meta> charset vs. HTTP charset here for comparison purposes :-).

>
> > 3. the question of inheritance is unclear when using the meta
> > statement for
> > declaring the text-processing language
>
> The spec now makes it clear.
... and Richard and I are trying to get you to make a different bit of clarity here.

I would add: having a over-arching "default text processing language" above the <html> element would probably create additional problems for implementation of CSS :lang pseudo-attribute, etc., that do language selection in documents by having something outside the parse tree affect the value of the (implied) xml:lang/html lang.

>
> > If the meta statement continues to be allowed, I suggest that it
> is
> > used in
> > the same way as a Content-Language declaration in the HTTP header,
> > ie. as
> > metadata about the document as a whole, but that such usage is
> kept
> > separate
> > from use for defining the language of a range of content. As far
> as
> > I can
> > tell, although Frontpage uses it and people on the Web recommend
> its
> > use, it
> > has no effect at all on content, and wouldn't be missed if it
> were
> > dropped.
>
> What purpose does metadata serve if it isn't actionable?
>
There are many uses for finding out the author's intended audience. A document, for example, might be mostly in Japanese although it serves an English-speaking audience. For example, it might be examples of Japanese writing with short descriptions in English. Other documents might be side-by-side (parallel) translations. The text processing language in these cases will follow specific spans of text; the audience, however, might not be one of the two streams of text.

Another use would be with language negotiation. The text processing language isn't as interesting as the author's intended audience in this case. A server might implement BCP 47's Lookup or Filtering algorithms against a user's Accept-Language to select content. Having the author's intended audience(s) in a Content-Language <meta> tag would facilitate that more readily.

Anyway, that's my €0,02.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.
Received on Friday, 15 August 2008 18:36:52 UTC