Re: ISSUE-88 - Change proposal (new update)

Lachlan Hunt, Fri, 07 May 2010 11:34:48 +0200:
> On 2010-05-05 20:22, Leif Halvard Silli wrote:
>> Let multiple language tags continue to be legal.
>> (http://www.w3.org/html/wg/wiki/ChangeProposals/ContentLanguages)
  ....
> the i18n WG.  Both proposals present similarly flawed arguments, 
> and so I will refute them  together.

The i18n WG does not pursue that proposal anymore.

> http://www.w3.org/International/wiki/Htmlissue88

>
> For this issue, we have 3 options presented:>
>
> 1. Make Content-Language non-conforming.
> 2. Leave Content-Language as Obsolete but Conforming,  
>    permitting only a single language tag. (Current spec)

A fundamental flaw in both option 1 and option 2, is that they ride two 
horses: on one side, they seek to redefine the semantics of the 
Content-Language http-equiv - aligning it with the @lang attribute, 
while at the same time defining it as invalid. Do I really need to say 
more? 

> 3. Leave Content-Language as Obsolete but Conforming, 
>    permitting a  comma separated list of language tags.
  ....

> This [the pragma] use case differs significantly from the primary use
> case for the HTTP Content-Language header, which is to indicate the
> languages of the intended audience for the document.

Regardless: HTML5 defines same behavior w.r.t. fallback, whether 
Content-Language comes from http-equiv for from http. 

> For the server, this use case makes some sense because it can be used 
> for content negotiation based on language.
  [...]
> However, this use case does not make any sense as in-document 
> metadata because once the user agent has the document, it's already 
> too late for such negotiation to occur.

Since a HTML5 parser *will* affect the fallback language regardless 
where Content-Language comes from, it makes sense to allow 
Content-Language both inside a document and on the server, the same way 
that authors may set the encoding both on the server side as well as in 
the document itself.

> The reality is that the in-document Content-Language directive only 
> shares its name with the HTTP header field, while, in practice, it's 
> functionality is closer to that of the lang attribute.  The solution 
> chosen for addressing this issue must take this into account.

That issue is taken into account, in all the 3 options, by HTML5's 
fallback language algorithm. The issue which the *proposals* seek to 
resolve, is the author conformance requirements and - ultimately - also 
the semantics.

> Although this is unnecessarily duplicated functionality, the problem 
> being solved is that it is already used on a relatively large number 
> of legacy pages and its use in this way is harmless.

None of the 3 proposals on the table consider this "harmless".

>  This means that 
> authors who are migrating existing pages to HTML5 do not have to be 
> too concerned about the presence of an innocuous element.  This is 
> why the spec currently makes it obsolete, but conforming when a 
> single value is used.

In a HTML5 parser, then multiple tags would literally be harmless, as 
it would not have any fallback language effect at all.

> Summary from Leif:
>> == Summary ==
>> * Multiple language tags (a comma separated list) in @http-equiv
>>    Content-Language continues to be legal.
>
> Summary from I18N WG:
  [ snipped, since they don't support it ]

> Neither of these summaries describe any use case for why authors 
> would want to specify multiple languages in the meta element. 

Use cases are provided in the Rationale section - not in the Summary.

> The only reason given simply states that it should be allowed because 
> HTML 4.01 and XHTML 1.0 permitted it.  That in itself is not a valid 
> reason.

Such a stamp inside the document might e.g. be used for checking that 
the correct page is served to the correct audience.

Forget about the effect, focus on the semantics. Let us say that we 
could define the *semantics* of Content-Language freely. Then, of 
course, we could say that Content-Language has the same semantics as 
@lang. However, as long as the HTTP spec defines the semantics, we 
create less confusion by simply accepting the fact: that it is the HTTP 
spec's domain.

For example, if you live in Norway and visit the Web site of a Japanese 
computer producer, then the Web site may direct you to their pages for 
their Norwegian customers. Which, however, still typically may be in 
English. If the Content-Language of one such page says 
"Content-Language: no", and lacks any @lang attribute, then the UA will 
think that the language of the page is Norwegian ... Which would not be 
true.

> In fact, HTML4 did not say anything explicit about the use of 
> Content-Language in the meta element. [...]

Incorrect. HTML4 says that http-equiv is governed by RFC2616:

]]
The http-equiv attribute [...] Please see the HTTP specification 
([RFC2616]) for details on valid HTTP headers.
[[ http://www.w3.org/TR/html4/struct/global#h-7.4.4.2


  ...
> The spec does, however, imply some client side processing for the 
> Content-Language HTTP header [...] [5]:
  ...
>   * The HTTP "Content-Language" header (which may be configured in a
>     server). For example:

> But note that it makes no mention of the meta http-equiv, [...]

That reading is not without virtue. However, another reading is that 
the text merely makes the reader aware that Content-Language does not 
need to be defined in the document - it could be defined in the server. 
HTML4 does anyhow describe the order of significance, between HTTP and 
HTTP-EQUIV, when it comes to the Content-Style-Type header, so it is a 
absolutely not foreign to HTML4 that http-equiv affects the UA 
directly. See: http://www.w3.org/TR/html4/present/styles#default-style


  ....
> We also have some observational evidence [6] that indicates that a 
> vast majority of authors only use a single value, 

If someone does use Content-Language according to the HTTP 
specification, then it would be correct to use "Content-Language: sv" 
for an English page aimed at a Swedish audience.  And then, according 
to HTML5, regardless of whether this value is found inside HTTP or 
HTTP-EQUIV, it *will* be used as the language of that document.

This is why the 'Let multiple language tags continue to be legal' 
proposal says that Content-Language should trigger a warning, every 
time the fallback language effect kicks in.

> Summary continued from Leif:
>> * Conformance checkers will emit a warning whenever  – and only if –
>>    the fallback language algorithm kicks in.
>> * The fallback warning will kick in regardless of whether the fallback
>>    comes from HTTP or Content-Language.
> 
> Summary continued from I18N WG:
  [ snipped again, since they don't pursue it ...]

> It's difficult to understand why you are arguing for multiple 
> languages to be considered conforming, while suggesting that the 
> defined implementation requirement is to ignore the value if multiple 
> languages are specified.  Your rationale in this case is 
> self-defeating.

a) The multilang proposal does not touch the HTML5 parsing
b) The rationale is stated in the Rationale section.

[ snipped again reference to I18N WG proposal ]

> Rationale from Leif:
>> == Rationale ==
>> The problems with the current specification are
>> 
>> 1. That it prevents authors from legally using multiple values to
>>    replicate the language fallback effect of doing the same thing
>>    in a HTTP header.
> 
> The element language fallback behaviour when taken from an HTTP 
> Content-Language header containing multiple langauges is to default 
> to unknown.  This is not useful behaviour for authors to explicitly 
> choose by using multiple languages in the meta element.  They get the 
> same result by omitting the Content-Language pragma from the document.

HTTP does also define what the lack of a Content-Language header means: 
It means that the document is for any audience.  So to not use 
Content-Language is not equivalent with using multiple language tags 
inside Content-Language.  The way to get authors to use 
Content-Language and @lang according to their semantics, it to provide 
a warning whenever Content-Language's observable side effect kicks in.

>>    * That no language gets set, as HTML5 requires from multiple tags
>> whether they occur in HTTP or in @http-equiv, is still an effect. The
>> spec is therefore incorrect in claiming about the latter that “[for
>> instance it only supports one language]”.
> 
> Your claim here does not make sense.  The HTTP Content-Language 
> header does allow multiple language tags, whereas the current HTML5 
> spec only allows one.  So that claim quoted from the spec is indeed 
> correct, as it currently stands.

I use the word "support" not about conformance requirements, but about 
observable effects. If a web browsers is programmed to have one 
behavior when Content-Language contains a single tag, and another 
behavior when it contains multiple tags, then clearly both single and 
multiple language tags are supported. (Current status, is that only 
Gecko supports multiple language tags - although it supports it in a 
different way from what the HTML5 spec requires. The other UAs 
typically treat multiple tags as an (illegal) single language tag.)

The biggest problem with *both* the current spec and the proposal to 
make Content-Langauge non-conforming, is that both of them consider 
that Content-Language is a way to define the language. Hence, both 
proposals suggests that the validator should say "Please use lang 
instead". And it also sends this warning *even when* then author has 
used @lang correctly. So it is just confusing. 

The 'multilang' proposal OTOH, puts the burden on the *validators* to 
analyze both HTTP header and HTTP-EQUIV, and compare it with the use of 
@lang: If the document has a lang attribute on the root element, then 
there is no reason to send any warning.  The 'multilang' proposal, 
OTOH, will create a warning in validators in a smaller percent of 
existing pages with a Content-Language header or pragma. The proposal 
quite precisely discern between potentially harming *(side) effects* of 
http and http-equiv.

>> 2. That it prevents @http-equiv from being used as a reference to what
>>    the HTTP Content-Language is/was meant to be.
>>    * Consider Firefox’ Page Info panel.
> 
> Firefox's Page Info panel is not a compelling use case for this 
> information.  It's just a diagnostic tool that outputs the specified 
> values.

It is what it is. 

>> Consider some CMSes.
> 
> CMSs use out of band information for determining the language of the 
> documents they send, if any.  This is more likely to come from 
> configuration settings, rather than the meta element specified 
> somewhere in the HTML, like in a page template.

CMSes would use Content-Language *not* for setting the language, but 
for setting the Content-Language.

>> Consider simply authors themselves.
> 
> What real benefit do authors themselves gain from using multiple 
> language values?

If you declare the page encoding as a HTTP header, then the <meta 
charset="*"> is also of no use.  However, as we know, pages are 
authored off-line, and then Content-Type and/or <meta charset="*"> 
matters. As an author, one may also consult the meta element to check 
what the encoding is. It is the same with Content-Language. It can be 
set on the server side - and it has many advantages to do so. But it 
may still be useful to also set it inside the document. The reason why 
the author does so, might be for the fallback effect. Or it might be 
for semantics, as defined in the HTTP spec.

>> 3. That it underlines the confusion that may exist today, about the
>>    nature of @lang versus Content-Language, by requiring:
>>    * different syntax rules for features that are expected to be
>>      identical (HTTP and @http-equiv )
> 
> In reality, as mentioned above, The HTTP header field and meta 
> element pragma directive only share a common name, while sharing very 
> little functionality.  And the little functionality that they do 
> share is as a secondary fallback for use in the absence of the lang 
> attribute.

This is to exaggerate: They share semantics, the share the language 
fallback effect. Of course, the language negotiation effect is not 
shared. And neither can one use any of HTML5's global attributes on the 
server side, despite that you can use them on the META element ...  So 
there are things that are naturally shared and things that are 
naturally different.

By the way: content negotiation may involve more than negotiation 
between different languages. In theory, it could also involve e.g. 
negotiation between different page encodings. And thus, once again, I 
want to emphasize how Content-Type and Content-Language are similar: 
one can set them either inside the document or on the server side. The 
semantics are the same. But the effects may not be the same.

>>    * similar syntax rules for features that are different
>>      (http-equiv and lang)
> 
> In practice, <meta http-equiv="Content-Language" content="en"> and 
> <html lang="en"> effectively share the same functionality.

No. They don't. Not unless you make a private decision to use 
Content-Language that way. And so one can never take for granted that 
it has been used that way. 

To create a specification which both says that it does have the same 
semantics as @lang, and at the same time forbidding it, is a very 
confusing way to change its semantics ...

>>    * a warning message which asks authors to “use @lang instead” – as if
>>      they were juxtaposable alternatives.
> 
> Use of the pragma directive is obsolete in HTML5.  Using a warning to 
> tell authors to use the better alternative is a good thing.

Incorrect. There is no warning about obsoleteness if one uses the 
Content-Type http-equiv.

Otherwise, you failed to take the point: If a page uses 
"Content-Language: no-no", because the page is for a Norwegian audience 
even if the text is in English, then it would be incorrect to tell 
authors to use @lang instead. Tjat is: you cannot tell them to move the 
"no-no" tag to the lang attribute instead. And also, if the 
Content-Language contains multiple language tags, what do you say to 
them then? "Please use lang instad"? That is: move your multiple tags 
to @lang instead? And what is the logic of telling authors to use @lang 
instead of multiple languages inside Content-Langauge, when multiple 
languages inside Content-Language would have no language fallback 
effect? 

>> Conformance checking and warnings are in place, but should be about the
>> correct things.
>> 
>> 1. The current warning about using @lang instead of Content-Language
>> should be changed into a warning which informs that a fallback language
>> measure has kicked in, and recommend that authors create a language
>> declaration (via @lang) rather than relying on the fallback feature.
> 
> Looking at the cases where the fallback behaviour will or will kick 
> in, we find the following:
> 
> Case 1:
> <html lang="en">
> <head>
>   <title>Example>
>   <meta http-equiv="Content-Language" content="en">
> </head>
> 
> The meta element here is completely useless.  The default language 
> for every will be obtained from the lang attribute, either on the 
> html element or a nearer ancestor.  Warning about it being useless 
> seems completely reasonable.

Whether it is useless depends on the author/user/CMS/Firefox. The above 
says that the page is in English and for an English audience. Another 
page might be in English but for a Norwegian speaking audience.

But if, for a moment, we accept that it is useless, then there are many 
other useless things the validator could warn about. E.g. it is also 
useless to put lang="en" on all of the child elements of <html>, since, 
as you explain, they inherit it from <html lang="en">-. Still, the 
validator does not warn about such use ... 

> Case 2:
> <html>
> <head>
>   <title>Example>
>   <meta http-equiv="Content-Language" content="en">
> </head>
> 
> Regardless of the presence of any other lang attributes anywhere else 
> in the document, the lack of the lang attribute on the html element 
> means that the fallback behaviour will kick in to determine the 
> language from the meta element.  I agree with that warning in this 
> case makes sense.

Cool! :-) it is only in Case 2 that I want any warning. In case 1, a 
warning is useless. Also, in case 1, then it is likely that the author 
did understand the difference between Content-Langauge and lang.

>> This warning should be shown regardless of whether the fallback comes
>> from @http-equiv or from the higher level (HTTP). Justification: Since
>> it is a fallback feature, and with other semantics, there is no
>> guarantee that the author has used it for the language effect.
> 
>> 2. To hold the syntax rules of HTTP (which permits multiple language
>> tags) as the conforming ones (rather than those of @lang, which forbids
>> multiple languages), will have the effect of underlining that @lang and
>> Content-Language have different purposes.
> 
> Again, use of the the Content-Langauge in the document has no other purpose.

You are allowed to have that opinion. But this is still not the 
semantics of Content-Language. Thus, as explained, one cannot take for 
granted that <html lang="*"> can replace Content-Language.

>> For instance, since the fallback algorithm doesn’t kick in whenever
>> multiple languages are used in the pragma or on the server, there
>> would not be any warning in these cases.
> 
> I do not understand what you are trying to say here.

The entire change proposal is strictly built around how Ian has defined 
that the pragma interacts with @lang. In that algorithm, if the pragma 
*or* the http header contains multiple languages, then no fallback 
language is defined. Thus, in an HTML5 parser, such Content-Language 
pragmas/headers will not affect the document in anyway. And thus, there 
is no reason to warn authors that a fallback language measure has 
kicked in, since, in these cases, no fallback language measure will 
kick in.

This is an very important point: the multilang change proposal is not 
identical with the original proposal from the i18n wg. The multilang 
proposal says that there should be a warning when the *fallback* effect 
kicks in. What the correct lang attribute is, is up to the author to 
find out.  Hopefully the author understands that he used 
Content-Langauge for the wrong reason, and removes it. No need to tell 
him that it is useless. He understands.

>> == Details ==
>> Proposed spec changes, to section [4.2.5.3 Pragma directives]:
>> 
>> Replace the following text
>>    ]]  Conformance checkers will include a warning if this pragma is
>> used. Authors are encouraged to use the @lang attribute instead.[HTTP]
>> [[
>> 
>> with the following
>>    ]]  The semantics of this pragma, as well as of the HTTP
>> Content-Language header, are different from the semantics of the @lang
>> attribute. [HTTP] Thus, there is no guarantee that the author
>> consciously used either of them for setting the language. Therefore,
>> conformance checkers will include a warning, whenever HTML5’s fallback
>> language algorithm is activated, whether it is the higher protocol or
>> this pragma that kicks in. Authors are informed about which language
>> the document falls back to, and are encouraged to not rely on the
>> fallback feature but to instead explicitly use the @lang attribute on
>> the root element.  [[
> 
> It's not clear exactly what you're referring to as the "fallback 
> language algorithm", and what it means for it to be activated. But I 
> assume you are referring to the requirement that states:

Yes, that's about right. I probably should find another wording.

>   "If none of the node's ancestors, including the root element, have
>    either attribute set, but there is a pragma-set default language
>    set, then that is the language of the node. If there is no
>    pragma-set default language set, then language information from a
>    higher-level protocol (such as HTTP), if any, must be used as the
>    final fallback language instead. In the absence of any such language
>    information, and in cases where the higher-level protocol reports
>    multiple languages, the language of the node is unknown, and the
>    corresponding language tag is the empty string."
> 
> This effectively defines the following order of preference for 
> obtaining the language information:
> 
> 1. lang attribute on the element
> 2. lang attribute on ancestor element
> 3. pragma-set default language (<meta>)
> 4. HTTP Content-Language header field if only one language is specified
> 5. Unknown language, the corresponding language tag is the empty string.

For step 3, you forgot to say "if only one language is specified". That 
is: same behavior as for HTTP.
 
> Based on your above rationale, you seem to want the warning to apply 
> if #3 or #4 is used, even though that's in the middle of the 
> algorithm that you are referring to.

Step 1 to 5 are optional steps. If there is a lang attribute on the 
element or on the parent, then there is no step 2, 3, 4 or 5. If there 
is no lang attribute, but there is a pragma-set default then step 3 is 
the end of the algorithm. If there is not pragma-set default (that is: 
the pragma is emtpy or contains multiple languages or is is simply 
lacking), then we jump to step 4, which is treated the same way as step 
3.

So I don't understand what you say bout "middle of the algorithm". 

All that a validator needs to do is to check if the root element 
contains a lang attribute. If it doesn't, then, if either step 3 or 4 
results in a fallback language, then a warning should be shown.

>  It's not clear why you want 
> the warning if HTTP Content-Language is used with no lang attribute.  

??? The Content-Language HTTP header cannot be empty.

> And based on the way you phrased the proposed requirement, the 
> algorithm will have "kicked in" before it gets to #5, but it's 
> doesn't seem like you actually want a warning in that case.

The algorithm is defined not by me but by HTML5, and it only kicks 
in/affects the document if the root element doesn't have a lang 
attribute.  The only algorithm that *I* have tried to define, is the 
algorithm for when a validator should show a warning. See above.

> We can conclude from this that your proposed replacement text would 
> be inappropriate for use in the spec, even if the group decides to 
> permit multiple language values (despite the lack of convincing 
> rationale for doing so).

Fantasai has suggested some improvements, that I probably will 
incorporate.  But the most important thing is to agree on the direction 
... then we will find the words.

>> Delete the following text:
>>    ]]  This pragma is not exactly equivalent to the HTTP
>>  header, for instance it only supports one language.  [[
> 
> As explained above, this note is entirely accurate.  In practice, the 
> pragma directive in the meta element is in effect functionally the 
> same as the lang attribute, with little in common with the HTTP 
> header. Removing that note would not be useful.

The semantics of Content-Language are exactly the same both in 
http-equiv and in http. The differences in effect are only side effects 
of the format: http-equiv being an HTML element and http being a http 
header. Thus we agree that "it is not exactly equivalent". However, not 
everything that is true, is useful to express.  As for the specific 
thing that you focus on, the fallback language effect, then you are 
plain wrong - as told above, since 70 percent of browsers in use today 
do not treat multiple languages the way that HTMl5 say they should. 
Hence legacy user agent do not support multiple languages. While HTML5 
user agents will support it.

Please note that Gecko's treatment of multiple language tags and 
HTML5's treatment of multiple language tags are just two different way 
of "supporting" multiple language tags. 

>> == Impact ==
>> === Positive Effects ===
>> 1. More stable: same syntax as before continues to be permitted.
> 
> In the face of the above evidence against your proposal, it's not 
> clear that that is a positive effect.

I fail to see that you have provided evidence that I did not consider 
when writing that proposal.

>> 2. More permissive: authors, CMS-es and browsers can continue to take
>> advantage of @http-equiv ’s ability to reference what the HTTP header
>> is/was supposed to be, including replicating its fallback effect.
> 
> Given the practical effect of the directive, it has no relevance to 
> what the HTTP header is/was supposed to be.

If what you say here was true, then what HTML5 specifies for parsers 
w.r.t. Content-Language, would be useless: When user agents implement 
the HTML5 fallback language behavior, then it *will* have a practical 
effect. Whereas today they will treat <meta 
http-equiv="Content-Langauge" content="en,ru" > as if the language of 
the document is a language whose language tag is the five letters 
"en,ru", they will in the future not define any language, but instead 
look at the HTTP header, and use the language from the HTTP header, in 
case it contains a single tag. Thus "no effect", is also an effect. We 
are looking forward to that day when multiple language tags inside 
Content-Language will not have any fallback language effect! Currently 
that is not case.

>> 3. More correct: the difference between @lang and Content-Language is
>> pointed out, while the link between @http-equiv and HTTP is emphasized.
> 
> Wrong again, for reasons explained above.

You cannot be of that opinion? Clearly, if http-equiv and http uses the 
same syntax, such as HTML4 evidently expects, despite your claims to 
the opposite, then of course the link between the two are underlined. 
You might be of the opinion that this link should *not* be underlined - 
however that doesn't affect the logics of my argument.

>> 4. More useful: a warning that a fallback feature has kicked in, is
>> more useful than a warning which focuses on one of the places where the
>> fallback language could potentially kick in from. Why tell authors to
>> “use @lang insetad” if the author has already made sure that the @lang
>> attribute is in place?
> 
> The warning from the validator could be phrased in any way the 
> implementer likes.  If the lang attribute is detected, the validator 
> could simply state that the Content-Language is unnecessary.  
> Otherwise, the validator could advise to use the lang attribute 
> instead.  But this an implementation decison and  no spec change is 
> needed to attain the desired behaviour in this particular case.

Feel free to suggest what you consider as improvements to the two other 
proposals - I am not the correct person to contact then ... Currently, 
however, the spec is both very clear and 100% in line with your 
emphasize on reinterpreting the semantics of Content-Language - it 
says: ]] Authors are encouraged to use the lang attribute instead. [[

The multilang proposal, however, does not speak about using lang 
instead of Content-Language, but about using lang instead of relying on 
the *fallback effect* of content-language.

>> === Negative Effects ===
>> none
> 
> Actually, there are negative effects with your change proposal:
> 
> 1. Perpetuates the myth that the HTTP Content-Language header field 
> and the in-document pragma directive are equivalent, when they are 
> not.

Absolutely untrue. The opposite is the effect: Each time the fallback 
effect kicks in, there will be warning. There is not such warning 
today. Thus it seems very far fetched to say that it perpetuate any 
myths.

Both the other two proposals however creates a new myth - the myth that 
if you only remove the Content-Language pragma, then you are safe. 

The multilang proposal instead treats Content-Language the same way 
regardless of whether it comes from HTTP or from http-equiv. Thus 
validators will show a warning *every* time the fallback language kicks 
in. Instead of blindly forbidding the http-equiv version of the 
COntent-Language, without offering the author any help to understand 
what is going on.

> 2. Fails to warn against the use of a useless and obsolete feature in 
>    all cases.

It so called "fails" is because one of the focuses of the proposal, is 
correct language declaration - just as much as correct content-language 
declaration. If the language declaration is not affected (that is: if 
there is no fallback language effect), then there should be no 
warning.  (Validators should however also perform syntax checking - but 
that is partly another issue.)

> 3. Your proposed replacement text is entirely inappropriate for use 
> in the spec, for the reasons explained above.

If will help me to express my intentions better, then your welcome. ;-)

> IMHO, this now eliminates option #3 from the list I gave at the top 
> of this post, and leaves us with a decision between 2 valid 
> alternatives:

We probably disagree about how convincing your arguments were.

> 1. Make Content-Language non-conforming.

One problem with this option that I have not mentioned, is that it in 
many ways is identical with the status in HTML4: There will be no 
syntax checking of the content of the pragma.

[...]
-- 
leif halvard silli

Received on Saturday, 8 May 2010 01:16:27 UTC