RE: meta content-language from Ian Hickson on 2008-08-21 (public-html@w3.org from August 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 21 Aug 2008 01:36:34 +0000 (UTC)
Cc: 'HTML WG' <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0808210046270.14795@hixie.dreamhostps.com>
(To cut down on cross-posting, I haven't cc'ed public-i18n-core@w3.org and 
www-international@w3.org, to which some of the messages included in this 
reply were sent.)

On Thu, 14 Aug 2008, Richard Ishida wrote:
>
> In the I18n Activity we considered the alternative ways of declaring 
> language for a long time, and the result of our thinking is summed up at
> 
> http://www.w3.org/International/questions/qa-http-and-lang
> 
> http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.095812479

Cool, thanks for those links. They are quite thorough.

The conclusion seems to be that there's not really any reason not to 
support Content-Language, though it may not be the best thing for authors 
to do.


> I would recommend that we keep the language attributes for declaring the 
> default language of the content (the text-processing language) and not 
> muddy the waters by using meta Content-Language declarations fulfill a 
> similar role, because:
>
> 1. the acceptable values are different and the meta approach is 
> incompatible with declaring the text-processing language

We can change what the acceptable values are, so that's not a big deal. 
Indeed HTML5 says that it must just be a language code, same as lang="".


> 2. the meta approach is really not used by anything according to the 
> tests I did

The reason we added it at all was that a log of people seem to be using 
it; it was one of the most common conformance errors according to the 
HTML5 validator logs.


> 3. the question of inheritance is unclear when using the meta statement 
> for declaring the text-processing language

How do you mean?


> If the meta statement continues to be allowed, I suggest that it is used 
> in the same way as a Content-Language declaration in the HTTP header, 
> ie. as metadata about the document as a whole, but that such usage is 
> kept separate from use for defining the language of a range of content.

I don't understand what you mean here. If it's used as metadata for the 
language of the document, how can it not affect the language of the 
document?

Right now, as specified, the Content-Language pragma in HTML5 sets the 
default language, much as setting the value on the <html> element.


> As far as I can tell, although Frontpage uses it and people on the Web 
> recommend its use, it has no effect at all on content, and wouldn't be 
> missed if it were dropped.

It appears to be used by at least some browsers for the processing of the 
:lang() pseudo-class. I don't see why we would want to stop that.


> I also think that we should avoid introducing the Content-Language 
> pragma as yet another way of declaring the default text-processing 
> language of the document since [a] it's already complicated enough to 
> explain to authors how to set up language information, [b] Google 
> surveys show that over recent years people have begun to use <html 
> lang=... for this (as we've been recommending), and [c] it's unnecessary 
> duplication.

It doesn't have to be duplication. People need only use one.

The alternative as I see it is to disallow it. The problem there is that 
people will get error messages in conformance checkers, and those error 
messages aren't very useful.


> Also, the Content language selection algorithm in 4.2.5.3 makes no 
> mention of <html lang=.. as a way of identifying the default language, 
> which it actually does if it is present, since it has higher priority 
> than HTTP metadata.

<html lang=""> only sets the language for the content of the <html> 
element, it doesn't set the language for, e.g., comment nodes outside the 
<html> element. See the definition of lang="" in HTML5 for details.


On Thu, 14 Aug 2008, Phillips, Addison wrote:
> 
> I concur with Richard's analysis. I further think that, while it is a 
> little complicated to explain the difference between content metadata 
> and text processing language, we should be very careful to avoid 
> implying that <meta> or a pragma at all affect how a document is 
> processed. I'd even go further and suggest that it should not affect the 
> text processing language. The text processing language is a job for 
> lang/xml:lang.

So you would disallow the Content-Language pragma?


> Frontpage's checking of the format of the content-language header is a 
> Good Thing but shouldn't imply anything about how that document later is 
> processed.

So Frontpage is _wrong_ to use this value to control what spell-checker to 
use? If so, what use is the value at all?


> In looking at 4.2.5.3, I also notice some problems:
> 
> 1. The content language collecting algorithm doesn't deal with the fact 
> that there can be multiple languages in the <meta> tag. It is written in 
> a way that jams them together. Thus the value "fr, de" becomes the 
> language "frde", which is not the intention.

No, it treats "fr,de" as "fr" and ignores the "de". See the definition of 
"Collect a sequence of characters" for more precise details.


> Since <meta> can and should be permitted to contain a list of languages, 
> an algorithm that sets the content language to only the first language 
> is inappropriate.

Right now, the HTML5 spec says that only one language is allowed. What 
should the processing be if multiple languages are provided? It doesn't 
appear that people use it to provide multiple languages, they only provide 


> 2. The reference is to "RFC 3066", which is obsolete. Please reference 
> BCP 47. I know you mean to permit 
> formerly-illegal-but-syntactically-permitted tags from the 3066 era, but 
> BCP 47 provides a production for that purpose (obs-language). And 
> illegal tags are, well, illegal tags. We shouldn't encourage them.

Actually the only reason we referene 3066 and not BCP47 is that I haven't 
gone through and updated the references yet. BCP47 didn't exist when we 
started working on HTML5. I'm doing the reference updates in about a year.


On Sat, 16 Aug 2008, Leif Halvard Silli wrote:
> 
> There is currently nothing, I think, which prevent authors from making 
> the META content-language element appear several times in the same 
> <head> element. Did the algorithm take this into account?

Only the first is used. The others are non-conforming.


> Take for example Firefox. If you add these two elements,
> 
> <meta http-equiv=Content-Language content="en,ru,uk,nn,nb,sv,el" >
> <meta http-equiv=Content-Language content="de">
> 
> then Firefox, in its page-info function, will list both all the 
> languages of the first meta element, as well as the 'de' in the second 
> element as content languages. Wheras if you try to apply the both 
> :lang(en) and :lang(de), then only :lang(de) will work. The same goes 
> for WebKit.

Per the HTML5 spec, only :lang(en) should work.

Note that Webkit will, for this:

   <meta http-equiv=Content-Language content="de,pl">

...only make this match:

   :lang(de\,pl) { ... }

...which is also wrong according to HTML5.

In Firefox, given this:

   <meta http-equiv=Content-Language content="en">
   <meta http-equiv=Content-Language content="de,pl">

...the following rules will all match:

   :lang(de) { ...}
   :lang(pl) { ...}
   :not(:lang(en)) { ...}
   :lang(de):lang(pl) {...}

...which implies that Firefox is checking to see if nodes match the 
language rather than whether the language matches the selector.

These are very unusual cases and I don't propose to change the spec to 
match the browsers in this case. I think what the spec says is probably 
better. (Of course maybe we'll have to change when we get to candidate 
recommendation and test this with a test suite, if the browsers refuse to 
change to match.)


On Fri, 15 Aug 2008, Henri Sivonen wrote:
> 
> Of course, if the data is *wrong* significantly more often than lang='' 
> (assuming that the correctness level of lang='' establishes an implicit 
> data quality baseline), it would be good to ignore it. My guess is that 
> HTTP-level Content-Language is more likely to be wrong (it sure is less 
> obvious to diagnose) than any HTML-level declaration. (Due to Ruby's 
> Postulate: http://intertwingly.net/slides/2004/devcon/68.html )

Note that none of this applies to the HTTP header, I'm only talking about 
the <meta> pragma here.


On Fri, 15 Aug 2008, Phillips, Addison wrote:
> > 
> > The spec could make multiple language tags in Content-Language non- 
> > conforming and could make processing pick the first language tag.
> 
> In addition to being incompatible with existing Web content

Which Web content? Could you show us pages that actually use multiple 
language tags in the Content-Language <meta> pragma?


> I really don't see why we need to change the Content-Language meta tag 
> from indicating the target audience to indicating the processing 
> language. Since browsers don't make use of this information today for 
> processing the text, we'd be better to make existing practice formalized 
> than to change semantics.

They do make use of it.


> We don't have to ignore it. We can use that data for its most useful 
> purpose, which is as metadata about the author's intentions [...].

How? I've no idea what this means.


> I would add: having a over-arching "default text processing language" 
> above the <html> element would probably create additional problems for 
> implementation of CSS :lang pseudo-attribute, etc., that do language 
> selection in documents by having something outside the parse tree affect 
> the value of the (implied) xml:lang/html lang.

Why? It seems that implementations don't have any major problems here.


> There are many uses for finding out the author's intended audience. A 
> document, for example, might be mostly in Japanese although it serves an 
> English-speaking audience. For example, it might be examples of Japanese 
> writing with short descriptions in English. Other documents might be 
> side-by-side (parallel) translations. The text processing language in 
> these cases will follow specific spans of text; the audience, however, 
> might not be one of the two streams of text.

How would such metadata be actually practically used?


> Another use would be with language negotiation. The text processing 
> language isn't as interesting as the author's intended audience in this 
> case. A server might implement BCP 47's Lookup or Filtering algorithms 
> against a user's Accept-Language to select content. Having the author's 
> intended audience(s) in a Content-Language <meta> tag would facilitate 
> that more readily.

That seems highly hypothetical.


On Fri, 15 Aug 2008, Richard Ishida wrote:
> 
> Note that multiple language tags in Content-Language in the HTTP header 
> are a perfectly fine way to say "This is a document for people who read 
> both English and French" for example, ie. meta information about the 
> document itself.

Yes, nothing here is attempting to change HTTP semantics. We're only 
talking about the <meta> pragma. In HTML5, the <meta http-equiv> values 
are totally decoupled from HTTP.


> The Content-Language in the meta tag is modeled on the HTTP header 
> information, so changing it would in my opinion cause authors to get 
> even more confused than they already are.

That's certainly possible. We could just make it non-conforming, which 
would side-step this, but then it would raise new errors in the validator 
and that would be probably as much pain as the confusion or making the 
header not do anything useful.


> As it stands, it could be used as in-document metadata for occasions 
> when the HTTP protocol is not available. I don't think we should close 
> the door on that possibility, especially if an alternative and more 
> effective and more widely used method of declaring the default language 
> for processing the document exists (namely the lang attribute on the 
> html tag).

I don't really buy that the HTTP header is solving a real problem, so I 
don't think we need to address it in HTML. So I don't really buy the 
argument above.


On Mon, 18 Aug 2008, Richard Ishida wrote:
> 
> All this contributes to my feeling that the Content-Language information 
> is a different kind of beast to the language attribute.  I haven't 
> objected to the default language being set to the value of the 
> Content-Language in the absence of a language attribute, but in some of 
> the above cases to do so may actually introduce errors.

I certainly agree that the original definition of Content-Language's HTTP 
header isn't compatible with what we've done here. The question is, is it 
compatible with how people are using it? That's what really matters.


On Tue, 19 Aug 2008, Phillips, Addison wrote:
> 
> In fact, I'm not aware of any software which uses <meta> for page 
> processing or content selection. Hence my general tendency not to want 
> to add it to the determination of page processing language in HTML5: it 
> has the potential to alter how existing pages are processed to little 
> benefit.

Well, the benefit is that it has the potential to make the pages work 
better when people are using it instead of lang="", which, as I understand 
it, it appears people are.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 21 August 2008 01:37:00 UTC