- From: Mark Davis <mark.davis@icu-project.org>
- Date: Fri, 15 Aug 2008 11:14:20 -0700
- To: "Phillips, Addison" <addison@amazon.com>
- Cc: "Henri Sivonen" <hsivonen@iki.fi>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "HTML WG" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
- Message-ID: <30b660a20808151114y8293b36xc67f1a8f966e789b@mail.gmail.com>
Unfortunately, at least in our experience at Google, the language tags are very inaccurately applied, either as representing the target audience or the language of individual segments of text. And even more often they are just simply missing. We also see many cases of contradictions between the tags using with different mechanisms (http vs meta). The situation is similar to that of encoding tags, which are also wrong enough that they cannot be relied on on. There are important differences, however. The first is that the encoding is much more commonly used; a reasonably high percentage of pages have encoding tags, while a rather small percentage have language tags. Secondly, if the encoding is wrong, users (especially of languages often encoded not in UTF-8 or Latin-1) are used to changing the encoding via a View>Text Encoding menu or equivalent. (Sadly, search engines don't have a horde of Mechanical Turks to do this step-- they have to do better than the combination of browser + user action, which is a pretty high bar!) Compare this with language tags, where there is typically no possible user menu to "change" the language of the document that s/he is viewing, let alone a menu to change the language of a selection (either sentence(s) or sentence fragment(s)). Another difference, and perhaps the most important one, is that web page authors only tend to fix issues that result in noticeable, testable problems. The value of tagging language accurately is not visible for the huge majority of web pages, since in the vast majority of cases there is no immediate noticeable difference for users. Probably the most noticeable affect, and indirect one, is that accurate language tagging could theoretically make a difference for placement in search engines. This is, however, definitely a chicken-and-egg problem. Because language tagging is so inaccurate and so often missing, the search engines need to do mechanical language detection anyway. Because search engines do mechanical language detection anyway, there isn't much of a need to do accurate language tagging because it has no noticeable effects! The only place where it really would make a difference in practice is where mechanical detection has difficulties: in the few cases where there are languages that are quite close in terms of n-gram pairs and other characteristics commonly used for detection, such as Danish and Norwegian. So while I am all for clarifying standards, I'm not sure that in this area it will have very much practical import. Mark On Fri, Aug 15, 2008 at 8:26 AM, Phillips, Addison <addison@amazon.com>wrote: > > > > The spec could make multiple language tags in Content-Language non- > > conforming and could make processing pick the first language tag. > > In addition to being incompatible with existing Web content, I really don't > see why we need to change the Content-Language meta tag from indicating the > target audience to indicating the processing language. Since browsers don't > make use of this information today for processing the text, we'd be better > to make existing practice formalized than to change semantics. > > > > > > 2. the meta approach is really not used by anything according to > > the > > > tests I > > > did > > > > Given that people do put and have put language declarations there, > > is it good to keep ignoring that data? > > We don't have to ignore it. We can use that data for its most useful > purpose, which is as metadata about the author's intentions (much like > "keyword" was supposed to work). > > > > > Of course, if the data is *wrong* significantly more often than > > lang='' (assuming that the correctness level of lang='' establishes > > an > > implicit data quality baseline), it would be good to ignore it. My > > guess is that HTTP-level Content-Language is more likely to be > > wrong > > (it sure is less obvious to diagnose) than any HTML-level > > declaration. > > You could insert the never-ending saga of <meta> charset vs. HTTP charset > here for comparison purposes :-). > > > > > > 3. the question of inheritance is unclear when using the meta > > > statement for > > > declaring the text-processing language > > > > The spec now makes it clear. > > ... and Richard and I are trying to get you to make a different bit of > clarity here. > > I would add: having a over-arching "default text processing language" above > the <html> element would probably create additional problems for > implementation of CSS :lang pseudo-attribute, etc., that do language > selection in documents by having something outside the parse tree affect the > value of the (implied) xml:lang/html lang. > > > > > > If the meta statement continues to be allowed, I suggest that it > > is > > > used in > > > the same way as a Content-Language declaration in the HTTP header, > > > ie. as > > > metadata about the document as a whole, but that such usage is > > kept > > > separate > > > from use for defining the language of a range of content. As far > > as > > > I can > > > tell, although Frontpage uses it and people on the Web recommend > > its > > > use, it > > > has no effect at all on content, and wouldn't be missed if it > > were > > > dropped. > > > > What purpose does metadata serve if it isn't actionable? > > > > There are many uses for finding out the author's intended audience. A > document, for example, might be mostly in Japanese although it serves an > English-speaking audience. For example, it might be examples of Japanese > writing with short descriptions in English. Other documents might be > side-by-side (parallel) translations. The text processing language in these > cases will follow specific spans of text; the audience, however, might not > be one of the two streams of text. > > Another use would be with language negotiation. The text processing > language isn't as interesting as the author's intended audience in this > case. A server might implement BCP 47's Lookup or Filtering algorithms > against a user's Accept-Language to select content. Having the author's > intended audience(s) in a Content-Language <meta> tag would facilitate that > more readily. > > Anyway, that's my €0,02. > > Addison > > Addison Phillips > Globalization Architect -- Lab126 > > Internationalization is not a feature. > It is an architecture. > > > >
Received on Friday, 15 August 2008 18:14:59 UTC