W3C home > Mailing lists > Public > www-international@w3.org > April to June 2010

Re: ISSUE-88 - Change proposal (new update)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Fri, 07 May 2010 11:34:48 +0200
Message-ID: <4BE3DEB8.60708@lachy.id.au>
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, www-international@w3.org
On 2010-05-05 20:22, Leif Halvard Silli wrote:
> Let multiple language tags continue to be legal.
> (http://www.w3.org/html/wg/wiki/ChangeProposals/ContentLanguages)

This is a response to the arguments put forth in both that change 
proposal, the the change proposal from the i18n WG.  Both proposals 
present similarly flawed arguments, and so I will refute them together.

http://www.w3.org/International/wiki/Htmlissue88

For this issue, we have 3 options presented:

1. Make Content-Language non-conforming.
2. Leave Content-Language as Obsolete but Conforming, permitting only a 
single language tag. (Current spec)
3. Leave Content-Language as Obsolete but Conforming, permitting a comma 
separated list of language tags.

To make an effective choice between these 3 alternatives, it's important 
to understand the use cases and problems that can be addressed with 
Content-Language, and to understand why the spec currently specifies 
option #2.

The Content-Langauge pragma directive was added as a result of feedback 
from Henri Sivonen sent in April 2008 [1], and addressed by Hixie in 
August [2].

In summary, the feedback described the observed usage pattern among 
authors as being used analogously to the lang attribute for providing 
client side language meta data.  This matches the support among 
browsers, which use the value in the inheritance chain for determining 
the default language of elements in the document.  Although, as Hixie 
noted in the zero-edit change proposal [3], browsers currently lack 
interoperability on this issue, especially when multiple values are 
involved.

This use case differs significantly from the primary use case for the 
HTTP Content-Language header, which is to indicate the languages of the 
intended audience for the document.

For the server, this use case makes some sense because it can be used 
for content negotiation based on language.  A browser that declares 
Accept-Language can, if the server is configured for it, receive a 
document in a language intended for speakers of the declared language. 
The Content-Language and other relevant header fields can then be used 
by the server to declare this information, which can then be seen by 
intermediary servers, such as caches, to properly cache the document. 
For this case, at least in theory, multiple language tags may be 
appropriate since, for example, some written languages are close enough 
to each other that they can be understood by people who speak either, 
and so such a document could be appropriate for all similar languages.

However, this use case does not make any sense as in-document metadata 
because once the user agent has the document, it's already too late for 
such negotiation to occur.

The reality is that the in-document Content-Language directive only 
shares its name with the HTTP header field, while, in practice, it's 
functionality is closer to that of the lang attribute.  The solution 
chosen for addressing this issue must take this into account.

Although this is unnecessarily duplicated functionality, the problem 
being solved is that it is already used on a relatively large number of 
legacy pages and its use in this way is harmless.  This means that 
authors who are migrating existing pages to HTML5 do not have to be too 
concerned about the presence of an innocuous element.  This is why the 
spec currently makes it obsolete, but conforming when a single value is 
used.

It is, however, questionable whether or not the usage of 
Content-Language within documents is significant enough for HTML5 to be 
concerned about it.  If it isn't, then Content-Language shouldn't be 
permitted at all in the meta element.

Summary from Leif:
> == Summary ==
> * Multiple language tags (a comma separated list) in @http-equiv
>    Content-Language continues to be legal.

Summary from I18N WG:
> The HTML 4.01 and XHTML syntax for the Content Language pragma allows
> for a comma separated list of languages as the value of the content
> attribute. It is proposed that this be reinstated in the HTML5
> specification, where current wording points to a single value.

Neither of these summaries describe any use case for why authors would 
want to specify multiple languages in the meta element.  The only reason 
given simply states that it should be allowed because HTML 4.01 and 
XHTML 1.0 permitted it.  That in itself is not a valid reason.

In fact, HTML4 did not say anything explicit about the use of 
Content-Language in the meta element.  It simply failed to impose any 
restrictions on the legal values of the content attribute, and it is 
only assumed that the value should match the valid value of the 
equivalent HTTP header.  It also did not specify any client side 
processing for the http-equiv attribute and associated values.  It 
simply stated [4]:

   http-equiv = name [CI]
     This attribute may be used in place of the name attribute. HTTP
     servers use this attribute to gather information for HTTP response
     message headers.

In practice, there are no known servers that utilise this information 
from the document itself.

The spec does, however, imply some client side processing for the 
Content-Language HTTP header, where it specifies the inheritence of 
language codes [5]:

   An element inherits language code information according to the
   following order of precedence (highest to lowest):

   * The lang attribute set for the element itself.
   * The closest parent element that has the lang attribute set
     (i.e., the lang attribute is inherited).
   * The HTTP "Content-Language" header (which may be configured in a
     server). For example:

       Content-Language: en-cockney

   * User agent default values and user preferences.

But note that it makes no mention of the meta http-equiv, which as 
mentioned above is technically reserved for server side processing. 
Also note that there is no advice given about the use of multiple 
language values, and thus there is an apparent assumption in HTML4 that 
there would only be a single value given.

We also have some observational evidence [6] that indicates that a vast 
majority of authors only use a single value, and that the minority of 
authors who do use multiple values, don't do so with the expectation of 
any significant or useful processing (except maybe a few who depend on 
it for the :lang() pseudo-class).

Summary continued from Leif:
> * Conformance checkers will emit a warning whenever  – and only if –
>    the fallback language algorithm kicks in.
> * The fallback warning will kick in regardless of whether the fallback
>    comes from HTTP or Content-Language.

Summary continued from I18N WG:
> Given that change, some wording also needs to be added to ensure that
> it is clear what to do to if no lang attribute is applied to content
> and the language of that content is to be inferred by examining the
> Content Language pragma, if there is one, when the content attribute
> contains multiple languages.
>
> The list consensus and recommendation of the i18n WG is that the
> Content Language pragma can only be used to infer the language of the
> document in this way if there is a single value in the content
> attribute - otherwise, the implementation should look for a higher
> level protocol, and failing to find one should accept that the value
> of the content is unknown (empty string).

It's difficult to understand why you are arguing for multiple languages 
to be considered conforming, while suggesting that the defined 
implementation requirement is to ignore the value if multiple languages 
are specified.  Your rationale in this case is self-defeating.

> It is also desirable that additional clarity be provided as to the
> differences between language declarations in the HTTP/pragma
> locations and those in language attributes on elements.

I am not opposed to further clarifying the difference between the HTTP 
Content-Language header and the lang attribute.  But pretending that 
http-equiv=Content-Language is similar to the HTTP header is not helpful.

Rationale from Leif:
> == Rationale ==
> The problems with the current specification are
>
> 1. That it prevents authors from legally using multiple values to
>    replicate the language fallback effect of doing the same thing
>    in a HTTP header.

The element language fallback behaviour when taken from an HTTP 
Content-Language header containing multiple langauges is to default to 
unknown.  This is not useful behaviour for authors to explicitly choose 
by using multiple languages in the meta element.  They get the same 
result by omitting the Content-Language pragma from the document.

>    * That no language gets set, as HTML5 requires from multiple tags
> whether they occur in HTTP or in @http-equiv, is still an effect. The
> spec is therefore incorrect in claiming about the latter that “[for
> instance it only supports one language]”.

Your claim here does not make sense.  The HTTP Content-Language header 
does allow multiple language tags, whereas the current HTML5 spec only 
allows one.  So that claim quoted from the spec is indeed correct, as it 
currently stands.

> 2. That it prevents @http-equiv from being used as a reference to what
>    the HTTP Content-Language is/was meant to be.
>    * Consider Firefox’ Page Info panel.

Firefox's Page Info panel is not a compelling use case for this 
information.  It's just a diagnostic tool that outputs the specified values.

> Consider some CMSes.

CMSs use out of band information for determining the language of the 
documents they send, if any.  This is more likely to come from 
configuration settings, rather than the meta element specified somewhere 
in the HTML, like in a page template.

> Consider simply authors themselves.

What real benefit do authors themselves gain from using multiple 
language values?

> 3. That it underlines the confusion that may exist today, about the
>    nature of @lang versus Content-Language, by requiring:
>    * different syntax rules for features that are expected to be
>      identical (HTTP and @http-equiv )

In reality, as mentioned above, The HTTP header field and meta element 
pragma directive only share a common name, while sharing very little 
functionality.  And the little functionality that they do share is as a 
secondary fallback for use in the absence of the lang attribute.

>    * similar syntax rules for features that are different
>      (http-equiv and lang)

In practice, <meta http-equiv="Content-Language" content="en"> and <html 
lang="en"> effectively share the same functionality.

>    * a warning message which asks authors to “use @lang instead” – as if
>      they were juxtaposable alternatives.

Use of the pragma directive is obsolete in HTML5.  Using a warning to 
tell authors to use the better alternative is a good thing.

> Conformance checking and warnings are in place, but should be about the
> correct things.
>
> 1. The current warning about using @lang instead of Content-Language
> should be changed into a warning which informs that a fallback language
> measure has kicked in, and recommend that authors create a language
> declaration (via @lang) rather than relying on the fallback feature.

Looking at the cases where the fallback behaviour will or will kick in, 
we find the following:

Case 1:
<html lang="en">
<head>
   <title>Example>
   <meta http-equiv="Content-Language" content="en">
</head>

The meta element here is completely useless.  The default language for 
every will be obtained from the lang attribute, either on the html 
element or a nearer ancestor.  Warning about it being useless seems 
completely reasonable.


Case 2:
<html>
<head>
   <title>Example>
   <meta http-equiv="Content-Language" content="en">
</head>

Regardless of the presence of any other lang attributes anywhere else in 
the document, the lack of the lang attribute on the html element means 
that the fallback behaviour will kick in to determine the language from 
the meta element.  I agree with that warning in this case makes sense.

> This warning should be shown regardless of whether the fallback comes
> from @http-equiv or from the higher level (HTTP). Justification: Since
> it is a fallback feature, and with other semantics, there is no
> guarantee that the author has used it for the language effect.

> 2. To hold the syntax rules of HTTP (which permits multiple language
> tags) as the conforming ones (rather than those of @lang, which forbids
> multiple languages), will have the effect of underlining that @lang and
> Content-Language have different purposes.

Again, use of the the Content-Langauge in the document has no other purpose.

> For instance, since the fallback algorithm doesn’t kick in whenever
> multiple languages are used in the pragma or on the server, there
> would not be any warning in these cases.

I do not understand what you are trying to say here.

> == Details ==
> Proposed spec changes, to section [4.2.5.3 Pragma directives]:
>
> Replace the following text
>    ]]  Conformance checkers will include a warning if this pragma is
> used. Authors are encouraged to use the @lang attribute instead.[HTTP]
> [[
>
> with the following
>    ]]  The semantics of this pragma, as well as of the HTTP
> Content-Language header, are different from the semantics of the @lang
> attribute. [HTTP] Thus, there is no guarantee that the author
> consciously used either of them for setting the language. Therefore,
> conformance checkers will include a warning, whenever HTML5’s fallback
> language algorithm is activated, whether it is the higher protocol or
> this pragma that kicks in. Authors are informed about which language
> the document falls back to, and are encouraged to not rely on the
> fallback feature but to instead explicitly use the @lang attribute on
> the root element.  [[

It's not clear exactly what you're referring to as the "fallback 
language algorithm", and what it means for it to be activated. But I 
assume you are referring to the requirement that states:

   "If none of the node's ancestors, including the root element, have
    either attribute set, but there is a pragma-set default language
    set, then that is the language of the node. If there is no
    pragma-set default language set, then language information from a
    higher-level protocol (such as HTTP), if any, must be used as the
    final fallback language instead. In the absence of any such language
    information, and in cases where the higher-level protocol reports
    multiple languages, the language of the node is unknown, and the
    corresponding language tag is the empty string."

This effectively defines the following order of preference for obtaining 
the language information:

1. lang attribute on the element
2. lang attribute on ancestor element
3. pragma-set default language (<meta>)
4. HTTP Content-Language header field if only one language is specified
5. Unknown language, the corresponding language tag is the empty string.

Based on your above rationale, you seem to want the warning to apply if 
#3 or #4 is used, even though that's in the middle of the algorithm that 
you are referring to.   It's not clear why you want the warning if HTTP 
Content-Language is used with no lang attribute.  And based on the way 
you phrased the proposed requirement, the algorithm will have "kicked 
in" before it gets to #5, but it's doesn't seem like you actually want a 
warning in that case.

We can conclude from this that your proposed replacement text would be 
inappropriate for use in the spec, even if the group decides to permit 
multiple language values (despite the lack of convincing rationale for 
doing so).

> Delete the following text:
>    ]]  This pragma is not exactly equivalent to the HTTP
> Content-Language header, for instance it only supports one language.  [[

As explained above, this note is entirely accurate.  In practice, the 
pragma directive in the meta element is in effect functionally the same 
as the lang attribute, with little in common with the HTTP header. 
Removing that note would not be useful.

> == Impact ==
> === Positive Effects ===
> 1. More stable: same syntax as before continues to be permitted.

In the face of the above evidence against your proposal, it's not clear 
that that is a positive effect.

> 2. More permissive: authors, CMS-es and browsers can continue to take
> advantage of @http-equiv ’s ability to reference what the HTTP header
> is/was supposed to be, including replicating its fallback effect.

Given the practical effect of the directive, it has no relevance to what 
the HTTP header is/was supposed to be.

> 3. More correct: the difference between @lang and Content-Language is
> pointed out, while the link between @http-equiv and HTTP is emphasized.

Wrong again, for reasons explained above.

> 4. More useful: a warning that a fallback feature has kicked in, is
> more useful than a warning which focuses on one of the places where the
> fallback language could potentially kick in from. Why tell authors to
> “use @lang insetad” if the author has already made sure that the @lang
> attribute is in place?

The warning from the validator could be phrased in any way the 
implementer likes.  If the lang attribute is detected, the validator 
could simply state that the Content-Language is unnecessary.  Otherwise, 
the validator could advise to use the lang attribute instead.  But this 
an implementation decison and  no spec change is needed to attain the 
desired behaviour in this particular case.

> === Negative Effects ===
> none

Actually, there are negative effects with your change proposal:

1. Perpetuates the myth that the HTTP Content-Language header field and 
the in-document pragma directive are equivalent, when they are not.
2. Fails to warn against the use of a useless and obsolete feature in 
all cases.
3. Your proposed replacement text is entirely inappropriate for use in 
the spec, for the reasons explained above.

IMHO, this now eliminates option #3 from the list I gave at the top of 
this post, and leaves us with a decision between 2 valid alternatives:

1. Make Content-Language non-conforming.
2. Leave Content-Language as Obsolete but Conforming, permitting only a 
single value

The choice between either of these depends entirely on whether or not 
the legacy usage of the Content-Language pragma is compelling enough for 
the spec to bless it as obsolete but conforming.  I do not have a strong 
opinion on this either way.

[1] http://lists.w3.org/Archives/Public/public-html/2008Apr/0556.html
[2] http://lists.w3.org/Archives/Public/public-html/2008Aug/0300.html
[3] http://lists.w3.org/Archives/Public/public-html/2010Apr/0307.html
[4] http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.2
[5] http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1.2
[6] http://lists.w3.org/Archives/Public/public-html/2010Apr/0088.html

-- 
Lachlan Hunt - Opera Software
http://lachy.id.au/
http://www.opera.com/
Received on Friday, 7 May 2010 09:35:22 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 7 May 2010 09:35:23 GMT