RE: ISSUE-88 / Re: what's the language of a document ? from Phillips, Addison on 2010-03-12 (public-html@w3.org from March 2010)

From: Phillips, Addison <addison@amazon.com>
Date: Fri, 12 Mar 2010 13:44:00 -0500
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, CE Whitehead <cewcathar@hotmail.com>
CC: "www-international@w3.org" <www-international@w3.org>, "public-html@w3.org" <public-html@w3.org>, "ishida@w3.org" <ishida@w3.org>, "ian@hixie.ch" <ian@hixie.ch>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA4129A74264A@EX-IAD6-B.ant.amazon.com>

(personal response)
> 
> > I like Leif's solution--to use the first language specified in http
> > as the text processing language when none is specified in the html
> > tag.
> 
> Perhaps this is something you could live with as well, Ian?
> 
> Otherwise, if at least one more person agrees, then I will formally
> write a change proposal which permits 'http-equiv="Content-
> Language"'
> to contain more than one language, but only when the root element
> uses the @lang attribute.

I think that creating a reliance on the contents of the @lang attribute in order to determine what is allowed in a <meta> tag is misguided. Ideally *all* documents will populate the root element with an appropriate @lang attribute (although, please note, that there exist cases in which an empty attribute *is* the appropriate value). However, the presence, absence, or value of @lang really should have no bearing on how many language tags the <meta> Content-Language element can contain. Authors should be able to know if the element is valid or not based simply on its own content. The format and rules for <meta> C-L are well-established, clear, and not difficult to understand. RFC 2616 was published in 1999 and it is extraordinarily clear in this regard. HTML, since at least HTML2, has clearly relied on the definition of HTTP headers for http-equiv <meta> elements (because that's what those various specs explicitly say).

I think that a definition of "compatibility" for HTML markup should be based on the specified syntax and not just the interpretation of that markup by a given implementation. HTML5 should provide (as it does well in so many cases) for well-defined interpretation of both valid and invalid markup. So...

Since the purpose is to populate HTTP headers or serve content management, server-side, or authoring needs; user-agents should mostly, in my opinion, respect existing valid markup, mostly by ignoring it. 

I think it would be appropriate to include text allowing the user-agent to infer the value of @lang from a properly formed <meta> tag in cases in which @lang is not present or empty. In that case, I would expect that the first language in the <meta> tag would be assigned as an inferred value (assuming, for a moment, that this value is "well-formed" according to BCP 47 or at least that the tags meet the requirements of the "obs-language-tag" production in BCP 47---otherwise the value should be ignored). Additional language tags can be trimmed off using a well-placed call to strtok (for example). This is what our working group's Change Proposal proposes.

I tend to agree with Hixie that there is not remarkable utility in this element for the user-agent. This being metadata, the utility is for external processes, as Richard, Roy, Martin, and others have also said. Implementations exist that use and rely on the specific multi-language syntax of that tag (even if they aren't browsers). The fact that browser implementations handle this element in a shoddy fashion is neither helpful nor very harmful but should not be the guiding factor in deciding on the syntax of HTML5 documents. There is no reason to break others just for bug compatibility with existing browsers... especially since the overall effect of assigning @lang from C-L or just plain ignoring <meta> C-L is almost non-existent.

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Friday, 12 March 2010 18:44:37 UTC