W3C home > Mailing lists > Public > public-html@w3.org > August 2008

Re: meta content-language

From: Mark Davis <mark.davis@icu-project.org>
Date: Mon, 25 Aug 2008 09:02:32 -0700
Message-ID: <30b660a20808250902w209480a6pf07db76ed55439e@mail.gmail.com>
To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
Cc: "Julian Reschke" <julian.reschke@gmx.de>, "Leif Halvard Silli" <lhs@malform.no>, "Ian Hickson" <ian@hixie.ch>, "HTML WG" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>
Mark


On Mon, Aug 25, 2008 at 1:49 AM, Martin Duerst <duerst@it.aoyama.ac.jp>wrote:

> At 23:04 08/08/22, Mark Davis wrote:
> >I'm kinda lost in this thread so far.
>
> This may be due to the fact that you don't seem to be too
> familliar with existing practice and history.
>
>
> >It seems to me the questions at had are:
> >
> >1. Distinction in Language. Should there be a distinction in
> interpretation between the language set via lang attribute and meta content?
> >
> ><html lang="foo">
> >and
> ><meta http-equiv="Content-Language" content="foo"/>
> >
> >My take is that any such distinction would be a departure from current
> practice, and too fine a distinction for the vast majority of people to be
> able to follow.
>
> Such a distinction IS current practice. The former can only
> contain one language, the later can contain a priority list.


True, there is that difference (as I noted below). And it is an unfortunate
one. But when there is a single language in both cases, the question is
whether there is an established difference in semantics.


> Also, the former is used on the browser side or by editing tools,
> whereas the later is used by the server side (see e.g. the
> examples that Roy gave).


Are you really sure that the latter is only used by the server side, never
by browsers? It would be interesting to see the evidence you base this on.


>
> As for "too fine a distinction for the vast majority of people
> to be able to follow", the people that we need to follow this
> distinction are Web page/contents creators for pages with
> multilingual content.
>
> The distinction is clearly given at
> http://www.w3.org/International/tutorials/language-decl/#Slide0060.
> If you think this is too difficult, and can be improved upon,
> please tell us why/how.


First, are you saying that this semantic distinction is present in the base
standards defining those fields, and not just in these tutorials (to use
your terms, "ab initio")? If so, I've missed it. Or are you saying that this
is established practice, in which case I'd like to see the evidence for
that.

I think it is a hopeless task to distinguish between these semantics. It is
hard enough to get people to say that a document is in Japanese (correctly),
let alone to get them to make the very fine point that this document is in
Japanese, but intended for French readers. That is far, far too fine a
distinction for people to reliably make, even such knowledgeable people as
"Web page/contents creators for pages with multilingual content". The
article has the feeling of trying to retrofit an existing situation in
making a distinction that frankly, is an exceedingly small percentage of
even multilingual contents, and will (I predict) never be followed with any
reliability.

>2. Language Inheritance. If there are conflicting languages, what should
> win? (or in other words, what's the inheritance?)
> >
> >(HTTP) Content-Language: lang1
> ><meta http-equiv="Content-Language" content="lang2"/>
> ><html lang="lang4" xml:lang="lang3">
> ><p lang="lang5">
>
>
>
> [please note that <meta> comes after <html> in an HTML document]
>
>
> >My take is that HTML5 has it right, that the winner/inheritance should be
> in the above order: lang5 wins over lang4 over lang3 over lang2 over lang1.
>
> What HTML5 currently says may make some sense if argued ab initio.
> Based on existing standards and practice, ignoring lang2 for
> language-oriented is well justified because it is wide practice.


Arguing on the basis that a field is ignored in wide practice is a bit
dangerous, since it is wide practice by some major vendors to ignore all of
these fields ;-)


>
>
> >3. Language Values. Should the value of any of these fields be a single
> language tag or also allow a priority list (both as defined by BCP47)?
> >
> >Note that it can be zero (""), which is equivalent to "und" (Unknown
> language) in BCP 47.
> >
> >Here I think we'd be somewhat better off if the value could be a priority
> list, eg "de, fr, en". For example, if the html lang value were "de, fr,
> en", that would mean that there wasn't any substantial amount of linguistic
> content other than these three, and that the relationship was de >= fr >=
> en. Due to the ordering, if you had software that could only handle a single
> language, then de would be that value.
> >
> >Documents may contain a mixture of languages, and allowing them to be
> tagged at a high level with a priority list would allow people to reflect
> that reality without having to tag each and every element with the right
> language. Software can make use of that information, for example, in ranking
> the document with respect to the language of search queries. With a search
> query in "fr", a document with html lang of "de, fr" could be treated
> differently than if it just had "de".
> >
> >However, that may be too big a departure from current practice.
>
> As you say in a followup post, HTTP Content-Language and <meta
> (because it is equivalent to HTTP Content-Language) take a language
> priority list, but the lang and xml:lang attributes don't.
>
> My take is that this is as it should be: Documents are often enough
> multilingual that it would be a bad idea to ignore this case.
>
> On the other hand, individual document pieces can at some level be
> identified as being in one (or no) language. Allowing multiple
> languages for document pieces would only bring very, very limited
> benefits at significantly higher costs (even if we could design
> HTML and XML anew and would not have to consider the existing base).
>
> There are multiple possible semantics for multiple languages
> (I'm using the attribute name multilang to not confuse people):
> - Alternative, unclear (e.g. <span multilang='en, fr'>cat</span>)
> - Alternative, both (e.g. <span multilang='en, fr'>excellent</span>;
>  sure there are better examples)
> - Summary (e.g. <p multilang='en, fr'>He said "Oui"</p>
>
> Obviously, having all of these doesn't help much for applications,
> and having only one of these eliminates the others. Probably the
> last one is what most people might expect, but it isn't really
> necessary assuming that the markup is reasonably designed, i.e.
> we can say <p lang='en'>He said "<span lang='fr'>Oui</span>"</p>.


While we "can" say that, the question is how many do? What percentage of
multilingual documents actually go do the trouble of marking each and every
language run? Take a wild guess, and we'll see how accurate you are.

What I was saying is that where these tags are used, some information is
better than none. Given that people won't be tagging each and every run by
the language, it is better to provide the information at the top that there
is substantial content in X languages.


And given that most XML applications (e.g. XSLT) have difficulties
> to handle even simple language information correctly, it doesn't
> seem a good idea to bother applications with something more
> complicated.



>
> Regards,    Martin.
>
>
> #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> #-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
>
>
Received on Monday, 25 August 2008 16:03:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:22 GMT