W3C home > Mailing lists > Public > www-international@w3.org > January to March 2010

RE: ISSUE-88 / Re: what's the language of a document ?

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Fri, 12 Mar 2010 22:43:47 +0100
To: "Phillips, Addison" <addison@amazon.com>
Cc: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, CE Whitehead <cewcathar@hotmail.com>, "www-international@w3.org" <www-international@w3.org>, "public-html@w3.org" <public-html@w3.org>, "ishida@w3.org" <ishida@w3.org>, "ian@hixie.ch" <ian@hixie.ch>
Message-ID: <20100312224347286845.c72d8548@xn--mlform-iua.no>
Phillips, Addison, Fri, 12 Mar 2010 13:44:00 -0500:
> (personal response)
>>> I like Leif's solution--to use the first language specified in http
>>> as the text processing language when none is specified in the html
>>> tag.
>> Perhaps this is something you could live with as well, Ian?
>> Otherwise, if at least one more person agrees, then I will formally
>> write a change proposal which permits 'http-equiv="Content-
>> Language"'
>> to contain more than one language, but only when the root element
>> uses the @lang attribute.
> I think that creating a reliance on the contents of the @lang 
> attribute in order to determine what is allowed in a <meta> tag is 
> misguided.

When we specify the encoding of a document with the <meta> element, 
then there is a relationship between what the <meta> element says, and 
the actual factual encoding of the document. Validator.nu doesn't 
accept if the wrong encoding is specified. Whether you change the 
encoding or the <meta> element, is of course up to you.

> Ideally *all* documents will populate the root element 
> with an appropriate @lang attribute (although, please note, that 
> there exist cases in which an empty attribute *is* the appropriate 
> value).

There is a difference between an empty attribute and no attribute. But 
only in XML: [1]

	]]XML also provides a means to prevent inheritance of language using 
the empty string, ie. xml:lang="". Essentially, this says: I do not 
want to associate any language with this information.[[

But not in HTML. If we wanted to have such a feature in HTML5, then, in 
order to work reliably, validators would have to warn against using 
'http-equiv="Content-Language"' at all, whenever the root element 
contained an empty lang="" attribute. Because, as it is, if we have a 
document like this:

<!DOCTYPE html><html lang=""><head>
<meta http-equiv="Content-Language" content="nn"/>
<body><p>This is English</p></html>

Then at least Firefox and Safari will treat the element as if nothing 
has been declared, and thus apply the language inside <meta> C-L as a 
fallback solution. I have Live DOM Viewer test that you can test for 
this. [2] Whereas Internet Explorer 8, will use the XML behaviour.

> However, the presence, absence, or value of @lang really 
> should have no bearing on how many language tags the <meta> 
> Content-Language element can contain. Authors should be able to know 
> if the element is valid or not based simply on its own content. The 
> format and rules for <meta> C-L are well-established, clear, and not 
> difficult to understand.

It is not very difficult to understand the issues that I described 
above either. My proposal doesn't affect the format and rules for 
<meta> C-L. It only takes its side effects into account.

> RFC 2616 was published in 1999 and it is 
> extraordinarily clear in this regard. HTML, since at least HTML2, has 
> clearly relied on the definition of HTTP headers for http-equiv 
> <meta> elements (because that's what those various specs explicitly 
> say).

I do not argue against HTTP. Not even against <meta> C-L. I suggest a 
compromise based on how user agents actually behave when they see it.

As to the clearness: HTML4 is 100% clear that it is the HTTP header 
that has highest priority - higher than any <meta> value. And this is 
also how user agents behave when it comes to encoding: If the HTTP 
header disagree with the <meta> specified encoding, then the HTTP 
header from the server "wins" over the encoding specified inside the 

But when it comes to language inheritance, and in that regard, the use 
of <meta> C-L as a back-up method for obtaining the language, then all 
User Agents that uses this back-up method give priority to the <meta> 
C-L over the content-language header that comes from the server. (If 
they consider the header on the server at all.) Completely on the head 
- and against the HTTP spec, I presume. Yet I have not heard anyone 
from the I18N WG describing this as a problem. For instance, it is not 
commented in the tests that I pointed to. [3]

> I think that a definition of "compatibility" for HTML markup should 
> be based on the specified syntax and not just the interpretation of 
> that markup by a given implementation. HTML5 should provide (as it 
> does well in so many cases) for well-defined interpretation of both 
> valid and invalid markup. So...
> Since the purpose is to populate HTTP headers or serve content 
> management, server-side, or authoring needs; user-agents should 
> mostly, in my opinion, respect existing valid markup, mostly by 
> ignoring it. 

Mostly? There is not much they could use it for, other than what they 
do use it for.
> I think it would be appropriate to include text allowing the 
> user-agent to infer the value of @lang from a properly formed <meta> 
> tag in cases in which @lang is not present or empty. In that case, I 
> would expect that the first language in the <meta> tag would be 
> assigned as an inferred value (assuming, for a moment, that this 
> value is "well-formed" according to BCP 47 or at least that the tags 
> meet the requirements of the "obs-language-tag" production in BCP 
> 47---otherwise the value should be ignored).

(I think we eventually should discuss how UAs should react to 
non-well-formed language tags etc in another thread.)

> Additional language tags 
> can be trimmed off using a well-placed call to strtok (for example). 
> This is what our working group's Change Proposal proposes.

That the language tags inside the <meta> C-L have to be given in a 
particular order, is not based on the HTTP specification from 1999. 
Such a requirement is eventually based on the User Agent behaviour that 
you prefer/have observed. And hence, the justification is no different 
from mine.

> I tend to agree with Hixie that there is not remarkable utility in 
> this element for the user-agent. This being metadata, the utility is 
> for external processes, as Richard, Roy, Martin, and others have also 
> said.


> Implementations exist that use and rely on the specific 
> multi-language syntax of that tag (even if they aren't browsers). The 
> fact that browser implementations handle this element in a shoddy 
> fashion is neither helpful nor very harmful but should not be the 
> guiding factor in deciding on the syntax of HTML5 documents. There is 
> no reason to break others just for bug compatibility with existing 
> browsers... especially since the overall effect of assigning @lang 
> from C-L or just plain ignoring <meta> C-L is almost non-existent.

Your proposal, which says that the first language tag inside <meta C-L 
should represent the language of the document, how should one test for 
this - e.g. in a validator? 

	Firstly: Whenever the root element do have the lang attribute, with a 
valid language tag inside, then this rule would not matter. (Unless you 
would suggest that validators should check that the first language tag 
inside <meta> C-L corresponds with the language tag inside the @root 
element - what for?)

	Secondly: If someone really makes sure - for themselves or by testing 
in browsers which supports this behavior (today there are none such, I 
believe) - that the language tag which represents the document of the 
language, appears first, then why couldn't he/she instead insert a 
correct lang attribute?

There are numerous problems with multiple language tags inside <meta> 
C-L when it comes to how user agents interpret them. In addition to 
what is listed above: Firefox looks for *any* language tag inside. When 
it comes to CSS, then the last CSS rule - not the first [or last or 
middle or whatever] language tag inside <meta> C-L - is what counts. 
So, a document can, as Firefox sees it, have multiple languages ... 
(Could be a useful CSS hack, course ...)

Btw, a question: HTML5 accepts multiple <meta> C-L elements in the same 
document. Are multiple <META> C-L elements meaningful from the HTTP 
specification’s point of view?

To clarify: I disagree with the current HTML5 spec which defines <meta> 
C-L as obsolete. 

[1] http://www.w3.org/International/articles/language-tags/#overview

[2] http://software.hixie.ch/utilities/js/live-dom-viewer/saved/400


leif halvard silli
Received on Friday, 12 March 2010 21:44:29 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:28 UTC