W3C home > Mailing lists > Public > public-html@w3.org > March 2010

RE: ISSUE-88 / Re: what's the language of a document ?

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Fri, 12 Mar 2010 22:43:47 +0100
To: "Phillips, Addison" <addison@amazon.com>
Cc: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, CE Whitehead <cewcathar@hotmail.com>, "www-international@w3.org" <www-international@w3.org>, "public-html@w3.org" <public-html@w3.org>, "ishida@w3.org" <ishida@w3.org>, "ian@hixie.ch" <ian@hixie.ch>
Message-ID: <20100312224347286845.c72d8548@xn--mlform-iua.no>
Phillips, Addison, Fri, 12 Mar 2010 13:44:00 -0500:
> (personal response)
>> 
>>> I like Leif's solution--to use the first language specified in http
>>> as the text processing language when none is specified in the html
>>> tag.
>> 
>> Perhaps this is something you could live with as well, Ian?
>> 
>> Otherwise, if at least one more person agrees, then I will formally
>> write a change proposal which permits 'http-equiv="Content-
>> Language"'
>> to contain more than one language, but only when the root element
>> uses the @lang attribute.
> 
> I think that creating a reliance on the contents of the @lang 
> attribute in order to determine what is allowed in a <meta> tag is 
> misguided.

When we specify the encoding of a document with the <meta> element, 
then there is a relationship between what the <meta> element says, and 
the actual factual encoding of the document. Validator.nu doesn't 
accept if the wrong encoding is specified. Whether you change the 
encoding or the <meta> element, is of course up to you.

> Ideally *all* documents will populate the root element 
> with an appropriate @lang attribute (although, please note, that 
> there exist cases in which an empty attribute *is* the appropriate 
> value).

There is a difference between an empty attribute and no attribute. But 
only in XML: [1]

	]]XML also provides a means to prevent inheritance of language using 
the empty string, ie. xml:lang="". Essentially, this says: I do not 
want to associate any language with this information.[[

But not in HTML. If we wanted to have such a feature in HTML5, then, in 
order to work reliably, validators would have to warn against using 
'http-equiv="Content-Language"' at all, whenever the root element 
contained an empty lang="" attribute. Because, as it is, if we have a 
document like this:

<!DOCTYPE html><html lang=""><head>
<meta http-equiv="Content-Language" content="nn"/>
<style>p:lang(nn){background:red}</style>
<body><p>This is English</p></html>

Then at least Firefox and Safari will treat the element as if nothing 
has been declared, and thus apply the language inside <meta> C-L as a 
fallback solution. I have Live DOM Viewer test that you can test for 
this. [2] Whereas Internet Explorer 8, will use the XML behaviour.

> However, the presence, absence, or value of @lang really 
> should have no bearing on how many language tags the <meta> 
> Content-Language element can contain. Authors should be able to know 
> if the element is valid or not based simply on its own content. The 
> format and rules for <meta> C-L are well-established, clear, and not 
> difficult to understand.

It is not very difficult to understand the issues that I described 
above either. My proposal doesn't affect the format and rules for 
<meta> C-L. It only takes its side effects into account.

> RFC 2616 was published in 1999 and it is 
> extraordinarily clear in this regard. HTML, since at least HTML2, has 
> clearly relied on the definition of HTTP headers for http-equiv 
> <meta> elements (because that's what those various specs explicitly 
> say).

I do not argue against HTTP. Not even against <meta> C-L. I suggest a 
compromise based on how user agents actually behave when they see it.

As to the clearness: HTML4 is 100% clear that it is the HTTP header 
that has highest priority - higher than any <meta> value. And this is 
also how user agents behave when it comes to encoding: If the HTTP 
header disagree with the <meta> specified encoding, then the HTTP 
header from the server "wins" over the encoding specified inside the 
document.

But when it comes to language inheritance, and in that regard, the use 
of <meta> C-L as a back-up method for obtaining the language, then all 
User Agents that uses this back-up method give priority to the <meta> 
C-L over the content-language header that comes from the server. (If 
they consider the header on the server at all.) Completely on the head 
- and against the HTTP spec, I presume. Yet I have not heard anyone 
from the I18N WG describing this as a problem. For instance, it is not 
commented in the tests that I pointed to. [3]

> I think that a definition of "compatibility" for HTML markup should 
> be based on the specified syntax and not just the interpretation of 
> that markup by a given implementation. HTML5 should provide (as it 
> does well in so many cases) for well-defined interpretation of both 
> valid and invalid markup. So...
> 
> Since the purpose is to populate HTTP headers or serve content 
> management, server-side, or authoring needs; user-agents should 
> mostly, in my opinion, respect existing valid markup, mostly by 
> ignoring it. 

Mostly? There is not much they could use it for, other than what they 
do use it for.
 
> I think it would be appropriate to include text allowing the 
> user-agent to infer the value of @lang from a properly formed <meta> 
> tag in cases in which @lang is not present or empty. In that case, I 
> would expect that the first language in the <meta> tag would be 
> assigned as an inferred value (assuming, for a moment, that this 
> value is "well-formed" according to BCP 47 or at least that the tags 
> meet the requirements of the "obs-language-tag" production in BCP 
> 47---otherwise the value should be ignored).

(I think we eventually should discuss how UAs should react to 
non-well-formed language tags etc in another thread.)

> Additional language tags 
> can be trimmed off using a well-placed call to strtok (for example). 
> This is what our working group's Change Proposal proposes.

That the language tags inside the <meta> C-L have to be given in a 
particular order, is not based on the HTTP specification from 1999. 
Such a requirement is eventually based on the User Agent behaviour that 
you prefer/have observed. And hence, the justification is no different 
from mine.

> I tend to agree with Hixie that there is not remarkable utility in 
> this element for the user-agent. This being metadata, the utility is 
> for external processes, as Richard, Roy, Martin, and others have also 
> said.

Indeed.

> Implementations exist that use and rely on the specific 
> multi-language syntax of that tag (even if they aren't browsers). The 
> fact that browser implementations handle this element in a shoddy 
> fashion is neither helpful nor very harmful but should not be the 
> guiding factor in deciding on the syntax of HTML5 documents. There is 
> no reason to break others just for bug compatibility with existing 
> browsers... especially since the overall effect of assigning @lang 
> from C-L or just plain ignoring <meta> C-L is almost non-existent.

Your proposal, which says that the first language tag inside <meta C-L 
should represent the language of the document, how should one test for 
this - e.g. in a validator? 

	Firstly: Whenever the root element do have the lang attribute, with a 
valid language tag inside, then this rule would not matter. (Unless you 
would suggest that validators should check that the first language tag 
inside <meta> C-L corresponds with the language tag inside the @root 
element - what for?)

	Secondly: If someone really makes sure - for themselves or by testing 
in browsers which supports this behavior (today there are none such, I 
believe) - that the language tag which represents the document of the 
language, appears first, then why couldn't he/she instead insert a 
correct lang attribute?

There are numerous problems with multiple language tags inside <meta> 
C-L when it comes to how user agents interpret them. In addition to 
what is listed above: Firefox looks for *any* language tag inside. When 
it comes to CSS, then the last CSS rule - not the first [or last or 
middle or whatever] language tag inside <meta> C-L - is what counts. 
So, a document can, as Firefox sees it, have multiple languages ... 
(Could be a useful CSS hack, course ...)

Btw, a question: HTML5 accepts multiple <meta> C-L elements in the same 
document. Are multiple <META> C-L elements meaningful from the HTTP 
specification’s point of view?

To clarify: I disagree with the current HTML5 spec which defines <meta> 
C-L as obsolete. 

[1] http://www.w3.org/International/articles/language-tags/#overview

[2] http://software.hixie.ch/utilities/js/live-dom-viewer/saved/400

[3] 
http://www.w3.org/International/tests/tests-html-css/tests-language-declarations/results-language-declarations#results

-- 
leif halvard silli
Received on Friday, 12 March 2010 21:44:24 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:15 UTC