- From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
- Date: Fri, 07 May 2010 11:34:48 +0200
- To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, www-international@w3.org
On 2010-05-05 20:22, Leif Halvard Silli wrote: > Let multiple language tags continue to be legal. > (http://www.w3.org/html/wg/wiki/ChangeProposals/ContentLanguages) This is a response to the arguments put forth in both that change proposal, the the change proposal from the i18n WG. Both proposals present similarly flawed arguments, and so I will refute them together. http://www.w3.org/International/wiki/Htmlissue88 For this issue, we have 3 options presented: 1. Make Content-Language non-conforming. 2. Leave Content-Language as Obsolete but Conforming, permitting only a single language tag. (Current spec) 3. Leave Content-Language as Obsolete but Conforming, permitting a comma separated list of language tags. To make an effective choice between these 3 alternatives, it's important to understand the use cases and problems that can be addressed with Content-Language, and to understand why the spec currently specifies option #2. The Content-Langauge pragma directive was added as a result of feedback from Henri Sivonen sent in April 2008 [1], and addressed by Hixie in August [2]. In summary, the feedback described the observed usage pattern among authors as being used analogously to the lang attribute for providing client side language meta data. This matches the support among browsers, which use the value in the inheritance chain for determining the default language of elements in the document. Although, as Hixie noted in the zero-edit change proposal [3], browsers currently lack interoperability on this issue, especially when multiple values are involved. This use case differs significantly from the primary use case for the HTTP Content-Language header, which is to indicate the languages of the intended audience for the document. For the server, this use case makes some sense because it can be used for content negotiation based on language. A browser that declares Accept-Language can, if the server is configured for it, receive a document in a language intended for speakers of the declared language. The Content-Language and other relevant header fields can then be used by the server to declare this information, which can then be seen by intermediary servers, such as caches, to properly cache the document. For this case, at least in theory, multiple language tags may be appropriate since, for example, some written languages are close enough to each other that they can be understood by people who speak either, and so such a document could be appropriate for all similar languages. However, this use case does not make any sense as in-document metadata because once the user agent has the document, it's already too late for such negotiation to occur. The reality is that the in-document Content-Language directive only shares its name with the HTTP header field, while, in practice, it's functionality is closer to that of the lang attribute. The solution chosen for addressing this issue must take this into account. Although this is unnecessarily duplicated functionality, the problem being solved is that it is already used on a relatively large number of legacy pages and its use in this way is harmless. This means that authors who are migrating existing pages to HTML5 do not have to be too concerned about the presence of an innocuous element. This is why the spec currently makes it obsolete, but conforming when a single value is used. It is, however, questionable whether or not the usage of Content-Language within documents is significant enough for HTML5 to be concerned about it. If it isn't, then Content-Language shouldn't be permitted at all in the meta element. Summary from Leif: > == Summary == > * Multiple language tags (a comma separated list) in @http-equiv > Content-Language continues to be legal. Summary from I18N WG: > The HTML 4.01 and XHTML syntax for the Content Language pragma allows > for a comma separated list of languages as the value of the content > attribute. It is proposed that this be reinstated in the HTML5 > specification, where current wording points to a single value. Neither of these summaries describe any use case for why authors would want to specify multiple languages in the meta element. The only reason given simply states that it should be allowed because HTML 4.01 and XHTML 1.0 permitted it. That in itself is not a valid reason. In fact, HTML4 did not say anything explicit about the use of Content-Language in the meta element. It simply failed to impose any restrictions on the legal values of the content attribute, and it is only assumed that the value should match the valid value of the equivalent HTTP header. It also did not specify any client side processing for the http-equiv attribute and associated values. It simply stated [4]: http-equiv = name [CI] This attribute may be used in place of the name attribute. HTTP servers use this attribute to gather information for HTTP response message headers. In practice, there are no known servers that utilise this information from the document itself. The spec does, however, imply some client side processing for the Content-Language HTTP header, where it specifies the inheritence of language codes [5]: An element inherits language code information according to the following order of precedence (highest to lowest): * The lang attribute set for the element itself. * The closest parent element that has the lang attribute set (i.e., the lang attribute is inherited). * The HTTP "Content-Language" header (which may be configured in a server). For example: Content-Language: en-cockney * User agent default values and user preferences. But note that it makes no mention of the meta http-equiv, which as mentioned above is technically reserved for server side processing. Also note that there is no advice given about the use of multiple language values, and thus there is an apparent assumption in HTML4 that there would only be a single value given. We also have some observational evidence [6] that indicates that a vast majority of authors only use a single value, and that the minority of authors who do use multiple values, don't do so with the expectation of any significant or useful processing (except maybe a few who depend on it for the :lang() pseudo-class). Summary continued from Leif: > * Conformance checkers will emit a warning whenever – and only if – > the fallback language algorithm kicks in. > * The fallback warning will kick in regardless of whether the fallback > comes from HTTP or Content-Language. Summary continued from I18N WG: > Given that change, some wording also needs to be added to ensure that > it is clear what to do to if no lang attribute is applied to content > and the language of that content is to be inferred by examining the > Content Language pragma, if there is one, when the content attribute > contains multiple languages. > > The list consensus and recommendation of the i18n WG is that the > Content Language pragma can only be used to infer the language of the > document in this way if there is a single value in the content > attribute - otherwise, the implementation should look for a higher > level protocol, and failing to find one should accept that the value > of the content is unknown (empty string). It's difficult to understand why you are arguing for multiple languages to be considered conforming, while suggesting that the defined implementation requirement is to ignore the value if multiple languages are specified. Your rationale in this case is self-defeating. > It is also desirable that additional clarity be provided as to the > differences between language declarations in the HTTP/pragma > locations and those in language attributes on elements. I am not opposed to further clarifying the difference between the HTTP Content-Language header and the lang attribute. But pretending that http-equiv=Content-Language is similar to the HTTP header is not helpful. Rationale from Leif: > == Rationale == > The problems with the current specification are > > 1. That it prevents authors from legally using multiple values to > replicate the language fallback effect of doing the same thing > in a HTTP header. The element language fallback behaviour when taken from an HTTP Content-Language header containing multiple langauges is to default to unknown. This is not useful behaviour for authors to explicitly choose by using multiple languages in the meta element. They get the same result by omitting the Content-Language pragma from the document. > * That no language gets set, as HTML5 requires from multiple tags > whether they occur in HTTP or in @http-equiv, is still an effect. The > spec is therefore incorrect in claiming about the latter that “[for > instance it only supports one language]”. Your claim here does not make sense. The HTTP Content-Language header does allow multiple language tags, whereas the current HTML5 spec only allows one. So that claim quoted from the spec is indeed correct, as it currently stands. > 2. That it prevents @http-equiv from being used as a reference to what > the HTTP Content-Language is/was meant to be. > * Consider Firefox’ Page Info panel. Firefox's Page Info panel is not a compelling use case for this information. It's just a diagnostic tool that outputs the specified values. > Consider some CMSes. CMSs use out of band information for determining the language of the documents they send, if any. This is more likely to come from configuration settings, rather than the meta element specified somewhere in the HTML, like in a page template. > Consider simply authors themselves. What real benefit do authors themselves gain from using multiple language values? > 3. That it underlines the confusion that may exist today, about the > nature of @lang versus Content-Language, by requiring: > * different syntax rules for features that are expected to be > identical (HTTP and @http-equiv ) In reality, as mentioned above, The HTTP header field and meta element pragma directive only share a common name, while sharing very little functionality. And the little functionality that they do share is as a secondary fallback for use in the absence of the lang attribute. > * similar syntax rules for features that are different > (http-equiv and lang) In practice, <meta http-equiv="Content-Language" content="en"> and <html lang="en"> effectively share the same functionality. > * a warning message which asks authors to “use @lang instead” – as if > they were juxtaposable alternatives. Use of the pragma directive is obsolete in HTML5. Using a warning to tell authors to use the better alternative is a good thing. > Conformance checking and warnings are in place, but should be about the > correct things. > > 1. The current warning about using @lang instead of Content-Language > should be changed into a warning which informs that a fallback language > measure has kicked in, and recommend that authors create a language > declaration (via @lang) rather than relying on the fallback feature. Looking at the cases where the fallback behaviour will or will kick in, we find the following: Case 1: <html lang="en"> <head> <title>Example> <meta http-equiv="Content-Language" content="en"> </head> The meta element here is completely useless. The default language for every will be obtained from the lang attribute, either on the html element or a nearer ancestor. Warning about it being useless seems completely reasonable. Case 2: <html> <head> <title>Example> <meta http-equiv="Content-Language" content="en"> </head> Regardless of the presence of any other lang attributes anywhere else in the document, the lack of the lang attribute on the html element means that the fallback behaviour will kick in to determine the language from the meta element. I agree with that warning in this case makes sense. > This warning should be shown regardless of whether the fallback comes > from @http-equiv or from the higher level (HTTP). Justification: Since > it is a fallback feature, and with other semantics, there is no > guarantee that the author has used it for the language effect. > 2. To hold the syntax rules of HTTP (which permits multiple language > tags) as the conforming ones (rather than those of @lang, which forbids > multiple languages), will have the effect of underlining that @lang and > Content-Language have different purposes. Again, use of the the Content-Langauge in the document has no other purpose. > For instance, since the fallback algorithm doesn’t kick in whenever > multiple languages are used in the pragma or on the server, there > would not be any warning in these cases. I do not understand what you are trying to say here. > == Details == > Proposed spec changes, to section [4.2.5.3 Pragma directives]: > > Replace the following text > ]] Conformance checkers will include a warning if this pragma is > used. Authors are encouraged to use the @lang attribute instead.[HTTP] > [[ > > with the following > ]] The semantics of this pragma, as well as of the HTTP > Content-Language header, are different from the semantics of the @lang > attribute. [HTTP] Thus, there is no guarantee that the author > consciously used either of them for setting the language. Therefore, > conformance checkers will include a warning, whenever HTML5’s fallback > language algorithm is activated, whether it is the higher protocol or > this pragma that kicks in. Authors are informed about which language > the document falls back to, and are encouraged to not rely on the > fallback feature but to instead explicitly use the @lang attribute on > the root element. [[ It's not clear exactly what you're referring to as the "fallback language algorithm", and what it means for it to be activated. But I assume you are referring to the requirement that states: "If none of the node's ancestors, including the root element, have either attribute set, but there is a pragma-set default language set, then that is the language of the node. If there is no pragma-set default language set, then language information from a higher-level protocol (such as HTTP), if any, must be used as the final fallback language instead. In the absence of any such language information, and in cases where the higher-level protocol reports multiple languages, the language of the node is unknown, and the corresponding language tag is the empty string." This effectively defines the following order of preference for obtaining the language information: 1. lang attribute on the element 2. lang attribute on ancestor element 3. pragma-set default language (<meta>) 4. HTTP Content-Language header field if only one language is specified 5. Unknown language, the corresponding language tag is the empty string. Based on your above rationale, you seem to want the warning to apply if #3 or #4 is used, even though that's in the middle of the algorithm that you are referring to. It's not clear why you want the warning if HTTP Content-Language is used with no lang attribute. And based on the way you phrased the proposed requirement, the algorithm will have "kicked in" before it gets to #5, but it's doesn't seem like you actually want a warning in that case. We can conclude from this that your proposed replacement text would be inappropriate for use in the spec, even if the group decides to permit multiple language values (despite the lack of convincing rationale for doing so). > Delete the following text: > ]] This pragma is not exactly equivalent to the HTTP > Content-Language header, for instance it only supports one language. [[ As explained above, this note is entirely accurate. In practice, the pragma directive in the meta element is in effect functionally the same as the lang attribute, with little in common with the HTTP header. Removing that note would not be useful. > == Impact == > === Positive Effects === > 1. More stable: same syntax as before continues to be permitted. In the face of the above evidence against your proposal, it's not clear that that is a positive effect. > 2. More permissive: authors, CMS-es and browsers can continue to take > advantage of @http-equiv ’s ability to reference what the HTTP header > is/was supposed to be, including replicating its fallback effect. Given the practical effect of the directive, it has no relevance to what the HTTP header is/was supposed to be. > 3. More correct: the difference between @lang and Content-Language is > pointed out, while the link between @http-equiv and HTTP is emphasized. Wrong again, for reasons explained above. > 4. More useful: a warning that a fallback feature has kicked in, is > more useful than a warning which focuses on one of the places where the > fallback language could potentially kick in from. Why tell authors to > “use @lang insetad” if the author has already made sure that the @lang > attribute is in place? The warning from the validator could be phrased in any way the implementer likes. If the lang attribute is detected, the validator could simply state that the Content-Language is unnecessary. Otherwise, the validator could advise to use the lang attribute instead. But this an implementation decison and no spec change is needed to attain the desired behaviour in this particular case. > === Negative Effects === > none Actually, there are negative effects with your change proposal: 1. Perpetuates the myth that the HTTP Content-Language header field and the in-document pragma directive are equivalent, when they are not. 2. Fails to warn against the use of a useless and obsolete feature in all cases. 3. Your proposed replacement text is entirely inappropriate for use in the spec, for the reasons explained above. IMHO, this now eliminates option #3 from the list I gave at the top of this post, and leaves us with a decision between 2 valid alternatives: 1. Make Content-Language non-conforming. 2. Leave Content-Language as Obsolete but Conforming, permitting only a single value The choice between either of these depends entirely on whether or not the legacy usage of the Content-Language pragma is compelling enough for the spec to bless it as obsolete but conforming. I do not have a strong opinion on this either way. [1] http://lists.w3.org/Archives/Public/public-html/2008Apr/0556.html [2] http://lists.w3.org/Archives/Public/public-html/2008Aug/0300.html [3] http://lists.w3.org/Archives/Public/public-html/2010Apr/0307.html [4] http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.2 [5] http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1.2 [6] http://lists.w3.org/Archives/Public/public-html/2010Apr/0088.html -- Lachlan Hunt - Opera Software http://lachy.id.au/ http://www.opera.com/
Received on Friday, 7 May 2010 09:35:24 UTC