- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Wed, 27 Aug 2008 12:12:50 +0900
- To: "Mark Davis" <mark.davis@icu-project.org>
- Cc: "Julian Reschke" <julian.reschke@gmx.de>, "Leif Halvard Silli" <lhs@malform.no>, "Ian Hickson" <ian@hixie.ch>, "HTML WG" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>
Hello Mark, At 01:02 08/08/26, Mark Davis wrote: >Mark > > >On Mon, Aug 25, 2008 at 1:49 AM, Martin Duerst <<mailto:duerst@it.aoyama.ac.jp>duerst@it.aoyama.ac.jp> wrote: >>At 23:04 08/08/22, Mark Davis wrote: >>>1. Distinction in Language. Should there be a distinction in interpretation between the language set via lang attribute and meta content? >>> >>><html lang="foo"> >>>and >>><meta http-equiv="Content-Language" content="foo"/> >>> >>>My take is that any such distinction would be a departure from current practice, and too fine a distinction for the vast majority of people to be able to follow. >> >>Such a distinction IS current practice. The former can only >>contain one language, the later can contain a priority list. > >True, there is that difference (as I noted below). And it is an unfortunate one. Please avoid such judgement-only statements. >But when there is a single language in both cases, the question is whether there is an established difference in semantics. This is totally the wrong question. If there is only a single language, everything falls together. Distinctions can be made, but they are more theoretical than practical. According to the priorities already mentioned, HTTP Content-Language serves as a fallback for language information inside the document, and so a single setting of HTTP Content-Language is supposed to be sufficient for the case of monolingual documents. >>Also, the former is used on the browser side or by editing tools, >>whereas the later is used by the server side (see e.g. the >>examples that Roy gave). > >Are you really sure that the latter is only used by the server side, never by browsers? It would be interesting to see the evidence you base this on. Where did I say "never by browsers"? Richard's tests show that there is a certain level of usage on the browser side, although less that for lang/xml:lang. >>As for "too fine a distinction for the vast majority of people >>to be able to follow", the people that we need to follow this >>distinction are Web page/contents creators for pages with >>multilingual content. >> >>The distinction is clearly given at >><http://www.w3.org/International/tutorials/language-decl/#Slide0060>http://www.w3.org/International/tutorials/language-decl/#Slide0060. >>If you think this is too difficult, and can be improved upon, >>please tell us why/how. > >First, are you saying that this semantic distinction is present in the base standards defining those fields, and not just in these tutorials (to use your terms, "ab initio")? If so, I've missed it. Where did you look? The first sentence of section 14.12 of RFC 2616, the current HTTP spec (http://www.ietf.org/rfc/rfc2616.txt), says: The Content-Language entity-header field describes the natural language(s) of the intended audience for the enclosed entity. Note that this might not be equivalent to all the languages used within the entity-body. Its predecessor, RFC 2068, now over 10 years old, says exactly the same in Section 14.13, although using examples rather than a definition-like statement. HTML4 (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1) mentions many usages related to language-specific rendering and processing. Much of it goes back to http://www.ietf.org/rfc/rfc2070.txt (see section 3), although that didn't mention spell checking. [XML is more vague given the more abstract target and writting style of that spec.] While all of the usages mentioned in HTML4 and elsewhere are valid, it is important to understand that the decisive driver for including lang into HTML4, and for including xml:lang into XML, was the perceived (by some quarters) inadequacy of Unicode with respect to language-specific rendering (e.g. differences between representative glyphs for some Unified Han Ideographs with respect to Chinese/Japanese/..., or some typical rendering differences for some characters between e.g. Serbian and Russian,...). >Or are you saying that this is established practice, in which case I'd like to see the evidence for that. Roy has mentioned how Content-Language is used for intended audience via language negotiation. The W3C Web site definitely has examples of this, as has the Apache Web site. Especially at W3C, there are quite a few examples of pages that are language-negotiated (intended audience) but contain snippets in languages not included in the Content-Language header (or possibly its meta equivalent). >I think it is a hopeless task to distinguish between these semantics. It is hard enough to get people to say that a document is in Japanese (correctly), let alone to get them to make the very fine point that this document is in Japanese, but intended for French readers. Is this a hypothetical example (in which case, please bring up a better one) or an actual example (in which case, please point to the actual document)? In general, in order for a document to be targeted at readers of a certain language, a significant portion of that document will be in that particular language. A typical example would be the various language variants served at/accessible from http://www.w3.org/International/articles/inline-bidi-markup/Overview.php. I think that authors of multilingual documents are well aware, probably more implicitly than even explicitly, of the function of each of their text pieces, and therefore of things such as intended audience and processing language. The problem isn't that such authors can't make this distinction, the problem is much more that the available tools don't make it easy enough to provide the information. On the HTTP (intended audience) side, we have the well-known problems of metadata server setup; on the @lang/@xml:lang side, we have the problem of editing tools not providing easy enough ways to specify language for pieces of text, and of not providing enough feedbacks and benefits to make it worthwhile and eliminate most mistakes (the oft-cited MS Word spell-checking being the laudable exception here). >That is far, far too fine a distinction for people to reliably make, even such knowledgeable people as "Web page/contents creators for pages with multilingual content". I think I have to strongly disagree here. As said above, authors of multilingual content are implicitly aware of the distinction. They just have to be prompted better, and outreach work such as Richard's article is one way to start. >The article has the feeling of trying to retrofit an existing situation in making a distinction Again, wrong, as I showed above by citing the HTTP spec. >that frankly, is an exceedingly small percentage of even multilingual contents, I guess you are wrong here, too. Multilingual content ranges from a basically monolingual page with just a few snippets in (a) different language(s) to parallel texts. For purely parallel texts, the distinction can be said to be irrelevant, but in my understanding, purely parallel texts form a clear minority of multilingual content. That would mean that the distinction is very relevant for the majority of multilingual content. >and will (I predict) never be followed with any reliability. My guess is that it will be followed with about as reliability as language tagging itself. Which I guess we both agree isn't very high, unfortunately. >>i.e. >>we can say <p lang='en'>He said "<span lang='fr'>Oui</span>"</p>. > >While we "can" say that, the question is how many do? What percentage of multilingual documents actually go do the trouble of marking each and every language run? Take a wild guess, and we'll see how accurate you are. Again, like everything in language tagging, the answer is "not very many". For "each and every run" the rate is clearly going to be extremely low, because it's closer to asking "How many large collections of documents have each and every document language-tagged correctly" than to ask about the correct language tagging rate for single documents. A more equivalent question might be: What percentage (in terms of character count) of multilingual documents are tagged correctly. Given that a lot of multilingual documents are documents in a single language with very little additional content in another language, my guess is that this percentage is not very far away from the same rate for monolingual documents. >What I was saying is that where these tags are used, some information is better than none. Here we get back to the "it depends on the application". For some applications, "some information" is clearly better than nothing, for other applications, a choice of languages is just as bad as no language information. As an example, if I know that some text is either French or English, how should I spellcheck it? (it's possible to immagine spell-checkers that take advantage of that information, but I don't know any that would actually do so). Also, the fact that many multilingual documents only contain small snippets in other languages means that merging all appearing languages at a point high up in the document structure may mean that there is actually less (correct) information than if just the main language were tagged. >Given that people won't be tagging each and every run by the language, it is better to provide the information at the top that there is substantial content in X languages. There is a high correlation between "target audience language" and "substantial content language". So overall, the situation isn't as bad as it may look. >>And given that most XML applications (e.g. XSLT) have difficulties >>to handle even simple language information correctly, it doesn't >>seem a good idea to bother applications with something more >>complicated. This is a point where I still have to see some response from you. Do you disagree? Do you think it's irrelevant? Or what? Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 27 August 2008 03:17:20 UTC