RE: ISSUE-88 / Re: what's the language of a document ? from CE Whitehead on 2010-03-12 (public-html@w3.org from March 2010)

From: CE Whitehead <cewcathar@hotmail.com>
Date: Fri, 12 Mar 2010 15:18:01 -0500
To: <addison@amazon.com>, <xn--mlform-iua@xn--mlform-iua.no>
CC: <www-international@w3.org>, <public-html@w3.org>, <ishida@w3.org>, <ian@hixie.ch>
Message-ID: <SNT142-w631A7A3FCD5C4EC8971B8AB3310@phx.gbl>
Hi.
 
From: Phillips, Addison <addison@amazon.com> 
Date: Fri, 12 Mar 2010 13:44:00 -0500
> (personal response)
> 
>> > I like Leif's solution--to use the first language specified in http
>> > as the text processing language when none is specified in the html
>> > tag.
> 
>> Perhaps this is something you could live with as well, Ian?
> 
>> Otherwise, if at least one more person agrees, then I will formally
>> write a change proposal which permits 'http-equiv="Content-
>> Language"'
>> to contain more than one language, but only when the root element
>> uses the @lang attribute.
> I think that creating a reliance on the contents of the @lang attribute in order to determine what is allowed in a 
> <meta> tag is misguided. Ideally *all* documents will populate the root element with an appropriate @lang attribute
> (although, please note, that there exist cases in which an empty attribute *is* the appropriate value). However, the
>  presence, absence, or value of @lang really should have no bearing on how many language tags the <meta> Content-> Language element can contain. Authors should be able to know if the element is valid or not based simply on its own 
> content. The format and rules for <meta> C-L are well-established, clear, and not difficult to understand. RFC 2616 
> was published in 1999 and it is extraordinarily clear in this regard. HTML, since at least HTML2, has clearly relied on the > definition of HTTP headers for http-equiv <meta> elements (because that's what those various specs explicitly say).
Agreed.
> I think that a definition of "compatibility" for HTML markup should be based on the specified syntax and not just the interpretation of that markup by a given implementation. HTML5 should provide (as it does well in so many cases) for well-defined interpretation of both valid and invalid markup. So...
> Since the purpose is to populate HTTP headers or serve content management, server-side, or authoring needs; user-
> agents should mostly, in my opinion, respect existing valid markup, mostly by ignoring it. 
> I think it would be appropriate to include text allowing the user-agent to infer the value of @lang from a properly 
> formed <meta> tag in cases in which @lang is not present or empty. In that case, I would expect that the first
> language in the <meta> tag would be assigned as an inferred value (assuming, for a moment, that this value is
> "well-formed" according to BCP 47 or at least that the tags meet the requirements of the "obs-language-tag" 
> production in BCP 47---otherwise the value should be ignored). Additional language tags can be trimmed off using a 
> well-placed call to strtok (for example). This is what our working group's Change Proposal proposes.
I thought this was Leif's proposal; perhaps I am mistaken here; but I thought Leif meant
to read the first value of the http header to infer the text-processing language when none was declared in the xml or html element,
and to otherwise consider two values o.k.
Thanks.
> I tend to agree with Hixie that there is not remarkable utility in this element for the user-agent.
Hmm
> This being metadata, the utility is for external processes, as Richard, Roy, Martin, and others have also said.
> Implementations exist that use and rely on the specific multi-language syntax of that tag (even if they aren't
> browsers). 
> The fact that browser implementations handle this element in a shoddy fashion is neither helpful nor very harmful but
>  should not be the guiding factor in deciding on the syntax of HTML5 documents. 
Agreed, we need not worry about how browser implementations handle it at present but only about setting standards that can be used now and in the future;
and yes this is meta-data for the most part.
> There is no reason to break others just for bug compatibility with existing browsers... especially since the overall effect > of assigning @lang from C-L or just plain ignoring <meta> C-L is almost non-existent.
> Addison
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization WG
* * *
From: Richard Ishida <ishida@w3.org> 
Date: Fri, 12 Mar 2010 17:35:02 -0000
> Wrt the pragma, though, unless you have conclusive evidence that no-one ever
> has nor currently intends to use the pragma as a declaration of metadata
> about the document, then you cannot change how it works, because you'll
> break things. I think one could argue for deprecation of the use of the
> pragma going forward, if you were able to convince me that it is not useful
> for in-document metadata declarations, but we shouldn't just change the
> syntax in a way that would cause problems for people who may have been
> following the HTML4 spec in good faith.
The in-document meta element is the place that declare the language when the document is multilingual; I can--and normally do--declare in the html tag as a text-processing language one of the languages I declare in the meta element but only one of course (there are occasions when it makes no sense to declare anything but the character set however).
Of course, I used to note that Word, when used to create .html/.htm pages, automatically detected the text-processing language and then used the meta element to indicate that--but I would not let the fact that the meta element was used automatically this way by Word cause me to make it illegal to list multiple languages in the meta element
in all cases; for one thing, I usually went into my source code and changed all the meta declarations to my liking, leaving only the author untouched.
I assume that some other authors changed the meta tags inserted by applications as well.

Best,
C. E. Whitehead
cewcathar@hotmail.com
Received on Friday, 12 March 2010 20:18:35 UTC