- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 9 Jan 1997 14:46:32 +0100 (MET)
- To: Larry Masinter <masinter@parc.xerox.com>
- Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
On Wed, 8 Jan 1997, Larry Masinter wrote: > # For HTML language tagging > # (the LANG attribute), we explicitly overruled this (see RFC2070). > # For HTTP, using a similar overruling would make sense. This would > # mean that a server would check for "en-us", and if not found, for "en". > > Please review section 14.4 of RFC 2068 (HTTP/1.1). I still haven't > quite understood if anyone thinks this section is wrong or should be > changed. Thanks for this hint! This not only gives a (partial) answer to the problem of language tag matching, but also some inditations in other areas that have been discussed in this thread. I'll take these issues first. For Accept-Language, RFC2068 says explicitly what q=0 means: >>>> The language quality factor assigned to a language-tag by the Accept-Language field is the quality value of the longest language- range in the field that matches the language-tag. If no language- range in the field matches the tag, the language quality factor assigned is 0. If no Accept-Language header is present in the request, the server SHOULD assume that all languages are equally acceptable. If an Accept-Language header is present, then all languages which are assigned a quality factor greater than 0 are acceptable. <<<< This clearly means that q=0 means NOT ACCEPTABLE. Whether this has to be interpreted as being a special case for Accept-Language, or an example of a general principle, is beyond my knowledge of the RFC and its creation process. Another very revealing detail is the following: >>>> Note: As intelligibility is highly dependent on the individual user, it is recommended that client applications make the choice of linguistic preference available to the user. If the choice is not made available, then the Accept-Language header field must not be given in the request. <<<< So the browser that was mentionned in an earlier mail not just contains a design problem, it also ignores recommendations made here. I don't think it is of use to demand flexibility for browser behaviour where the browsers ignore the specs. Otherwise, we could just stop to write specs at all. Now for the question of prefix matching. The RFC indeed defines prefix matching, very clearly and consistently. But this prefix matching works only one way: >>>> The Accept-Language request-header field is similar to Accept, but restricts the set of natural languages that are preferred as a response to the request. Accept-Language = "Accept-Language" ":" 1#( language-range [ ";" "q" "=" qvalue ] ) language-range = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" ) Each language-range MAY be given an associated quality value which represents an estimate of the user's preference for the languages specified by that range. The quality value defaults to "q=1". For example, Accept-Language: da, en-gb;q=0.8, en;q=0.7 would mean: "I prefer Danish, but will accept British English and other types of English." A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first tag character following the prefix is "-". The special range "*", if present in the Accept-Language field, matches every tag not matched by any other range present in the Accept-Language field. <<<< To give an example, we have the following situation: Accept-Language Document Match? language-range language-tag en en YES en-us en-us YES en en-us YES en-us en NO?! en-us en-uk NO?! The idea is that Accept-Language defines language-ranges, whereas the documents will be tagged exactly. I don't know exactly how the group arrived at this asymmetry, but I guess the basic thought was that for documents, it would be clear whether it was US or British English (and likewise in other cases), whereas the user would in general not care much about the difference. Prefixes (ranges) would therefore be used in Accept-Language, but not in document tags. Several points lead to the fact that the situation is not (or should not be) as asymmetric as described in the RFC. - Rarely both en-us and en-uk documents are prepared, and thus the authors don't care about distinguishing and just tag them with "en". - In some cases, there may be no actual difference, and it would be strange to label a document as en-us if it is just as well en-uk. - Tagging is in many cases done via file names. Something such as text.en.html and text.fr.html is preferred to text.en-us.html and text.fr-ch.html. - In many cases, language selections on the browser side are connected to locales. These include a lot of details where small differences matter, and are therefore finely granulated. I don't think Windows or the Mac have something like a "generic English" configuration. So probably, a symmetric solution, with prefix matching on both sides, is highly preferable. In this respect, the HTML solution (RFC 2070) is not exactly clear, because it only says that language tags are interpreted hierarchically, and gives one way of prefix matching as an example. It is not specified whether the other way of prefix matching is also allowed, or not. Apart from the small terminology problem that "language-range" and "language-tag" don't make sense anymore to distinguish the Accept side from the document side, the following part: A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first tag character following the prefix is "-". probably has to be changed as follows: A language-range matches a language-tag if a prefix of the language-range matches a prefix of the language-tag, such that for both prefixes, the prefix is equal to the whole identifier or the first character following the prefix is "-". There is the possibility that this goes too far. In the case of matching en-us with en-uk, it makes sense. But if we consider generic prefixes, such as "x-" for experimental, it wouldn't make sense to just return any kind of experimentally defined language just because the user has specified one particular kind of such tag. Matching of prefixes does not imply that the denoted languages are mutually intellegible. So an alternative would be: A language-range and a language-tag match if they are equal, or if a prefix of one of them exactly equals the other, such that the first character following the prefix is "-". Regards, Martin.
Received on Thursday, 9 January 1997 05:51:05 UTC