- From: Richard Ishida <ishida@w3.org>
- Date: Fri, 15 Aug 2008 21:23:15 +0100
- To: "'Mark Davis'" <mark.davis@icu-project.org>, "'Phillips, Addison'" <addison@amazon.com>
- Cc: "'Henri Sivonen'" <hsivonen@iki.fi>, "'Ian Hickson'" <ian@hixie.ch>, "'HTML WG'" <public-html@w3.org>, <www-international@w3.org>
Hi Mark, Some comments from a different perspective below... > From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On > Behalf Of Mark Davis > Sent: 15 August 2008 19:14 > To: Phillips, Addison ... > > Unfortunately, at least in our experience at Google, the language tags are very > inaccurately applied, either as representing the target audience or the language > of individual segments of text. And even more often they are just simply missing. > We also see many cases of contradictions between the tags using with different > mechanisms (http vs meta). Note that contradiction is not always wrong. I imagine there are many cases where the initial text processing language may be different from the intended audience of the page served. The question, really, is how many cases are problematic. > The situation is similar to that of encoding tags, which > are also wrong enough that they cannot be relied on on. > > There are important differences, however. The first is that the encoding is much > more commonly used; a reasonably high percentage of pages have encoding tags, > while a rather small percentage have language tags. > > Secondly, if the encoding is wrong, users (especially of languages often encoded > not in UTF-8 or Latin-1) are used to changing the encoding via a View>Text > Encoding menu or equivalent. (Sadly, search engines don't have a horde of > Mechanical Turks to do this step-- they have to do better than the combination of > browser + user action, which is a pretty high bar!) Compare this with language tags, > where there is typically no possible user menu to "change" the language of the > document that s/he is viewing, let alone a menu to change the language of a > selection (either sentence(s) or sentence fragment(s)). > > Another difference, and perhaps the most important one, is that web page > authors only tend to fix issues that result in noticeable, testable problems. The > value of tagging language accurately is not visible for the huge majority of web > pages, since in the vast majority of cases there is no immediate noticeable > difference for users. Well, that depends who you are and what you're trying to do. For some of the things that I've tried to do, there *are* immediate noticeable differences and I find language declaration very important at those times. > Probably the most noticeable affect, and indirect one, is that accurate language > tagging could theoretically make a difference for placement in search engines. > This is, however, definitely a chicken-and-egg problem. Because language tagging > is so inaccurate and so often missing, the search engines need to do mechanical > language detection anyway. Because search engines do mechanical language > detection anyway, there isn't much of a need to do accurate language tagging > because it has no noticeable effects! > > The only place where it really would make a difference in practice is where > mechanical detection has difficulties: in the few cases where there are languages > that are quite close in terms of n-gram pairs and other characteristics commonly > used for detection, such as Danish and Norwegian. > > So while I am all for clarifying standards, I'm not sure that in this area it will have > very much practical import. Hold on a second, Mark. There may be not much practical import for *you* since what you're interested in appears to be primarily bulk-mode search engine processing, but please don't discard the baby with the bath water. ;-) First, some of our discussion in this thread is aiming to clarify the process of declaring language so that perhaps people will get it right more in future. We have also not yet tried to evangelise the authoring tool developers. But more importantly, there are many other potential uses for language information (and I contest that there will be more in the future as user agents deal better with advanced script requirements) that informed and motivated people *can* use to good effect. Let's not just let the language thing fall by the wayside because searching doesn't currently use the information provided. It's not only about search results. Btw, I also found your talk "Unicode at Google"[1] particularly interesting since it shows a significant increase in the use of language tagging in the 5 years up to 2006. You show 14.4% of all pages using html lang - up from 1.88% in 2001. Meta language is used for 7.75% of pages, rising only a relatively small amount from 4.4% in 2001. HTTP Content-Language pops up on 6.01% of pages, up from 0.41% in 2001. I don't know how much overlap there is between meta and html lang labeling, though I'd be surprised if it was a lot. I would guess that we're looking at around 20% of pages having some form of language declaration in 2006, up from around 5% or so 5 years earlier. If that trend continues, that's a pretty good number of pages. Cheers, RI [1] http://www.macchiato.com/slides/unicode_at_google.ppt > > Mark ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/International/ http://rishida.net/
Received on Friday, 15 August 2008 20:23:59 UTC