Re: A broken browser from Martin J. Duerst on 1997-01-09 (ietf-http-wg@w3.org from January to March 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 9 Jan 1997 14:46:32 +0100 (MET)
To: Larry Masinter <masinter@parc.xerox.com>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <Pine.SUN.3.95.970109101229.245C-100000@enoshima>
On Wed, 8 Jan 1997, Larry Masinter wrote:

> # For HTML language tagging
> # (the LANG attribute), we explicitly overruled this (see RFC2070).
> # For HTTP, using a similar overruling would make sense. This would
> # mean that a server would check for "en-us", and if not found, for "en".
> 
> Please review section 14.4 of RFC 2068 (HTTP/1.1). I still haven't
> quite understood if anyone thinks this section is wrong or should be
> changed.

Thanks for this hint! This not only gives a (partial) answer to
the problem of language tag matching, but also some inditations
in other areas that have been discussed in this thread.

I'll take these issues first. For Accept-Language, RFC2068 says
explicitly what q=0 means:

>>>>
   The language quality factor assigned to a language-tag by the
   Accept-Language field is the quality value of the longest language-
   range in the field that matches the language-tag. If no language-
   range in the field matches the tag, the language quality factor
   assigned is 0. If no Accept-Language header is present in the
   request, the server SHOULD assume that all languages are equally
   acceptable. If an Accept-Language header is present, then all
   languages which are assigned a quality factor greater than 0 are
   acceptable.
<<<<

This clearly means that q=0 means NOT ACCEPTABLE. Whether this
has to be interpreted as being a special case for Accept-Language,
or an example of a general principle, is beyond my knowledge
of the RFC and its creation process.


Another very revealing detail is the following:

>>>>
     Note: As intelligibility is highly dependent on the individual
     user, it is recommended that client applications make the choice of
     linguistic preference available to the user. If the choice is not
     made available, then the Accept-Language header field must not be
     given in the request.
<<<<

So the browser that was mentionned in an earlier mail not just
contains a design problem, it also ignores recommendations
made here. I don't think it is of use to demand flexibility
for browser behaviour where the browsers ignore the specs.
Otherwise, we could just stop to write specs at all.


Now for the question of prefix matching. The RFC indeed defines
prefix matching, very clearly and consistently. But this prefix
matching works only one way:

>>>>
   The Accept-Language request-header field is similar to Accept, but
   restricts the set of natural languages that are preferred as a
   response to the request.

          Accept-Language = "Accept-Language" ":"
                            1#( language-range [ ";" "q" "=" qvalue ] )

          language-range  = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" )

   Each language-range MAY be given an associated quality value which
   represents an estimate of the user's preference for the languages
   specified by that range. The quality value defaults to "q=1". For
   example,

          Accept-Language: da, en-gb;q=0.8, en;q=0.7

   would mean: "I prefer Danish, but will accept British English and
   other types of English." A language-range matches a language-tag if
   it exactly equals the tag, or if it exactly equals a prefix of the
   tag such that the first tag character following the prefix is "-".
   The special range "*", if present in the Accept-Language field,
   matches every tag not matched by any other range present in the
   Accept-Language field.
<<<<

To give an example, we have the following situation:

Accept-Language      Document        Match?
language-range       language-tag

en                   en              YES
en-us                en-us           YES
en                   en-us           YES
en-us                en              NO?!
en-us                en-uk           NO?!


The idea is that Accept-Language defines language-ranges,
whereas the documents will be tagged exactly. I don't know
exactly how the group arrived at this asymmetry, but I
guess the basic thought was that for documents, it would
be clear whether it was US or British English (and
likewise in other cases), whereas the user would in
general not care much about the difference. Prefixes
(ranges) would therefore be used in Accept-Language, but
not in document tags.

Several points lead to the fact that the situation is not
(or should not be) as asymmetric as described in the RFC.

- Rarely both en-us and en-uk documents are prepared, and
	thus the authors don't care about distinguishing
	and just tag them with "en".
- In some cases, there may be no actual difference, and it
	would be strange to label a document as en-us if it
	is just as well en-uk.
- Tagging is in many cases done via file names. Something
	such as text.en.html and text.fr.html is preferred
	to text.en-us.html and text.fr-ch.html.
- In many cases, language selections on the browser side
	are connected to locales. These include a lot of
	details where small differences matter, and are
	therefore finely granulated. I don't think Windows
	or the Mac have something like a "generic English"
	configuration.

So probably, a symmetric solution, with prefix matching on
both sides, is highly preferable. In this respect, the
HTML solution (RFC 2070) is not exactly clear, because it
only says that language tags are interpreted hierarchically,
and gives one way of prefix matching as an example. It is
not specified whether the other way of prefix matching is
also allowed, or not.



Apart from the small terminology
problem that "language-range" and "language-tag" don't make
sense anymore to distinguish the Accept side from the document
side, the following part:

   A language-range matches a language-tag if
   it exactly equals the tag, or if it exactly equals a prefix of the
   tag such that the first tag character following the prefix is "-".

probably has to be changed as follows:

   A language-range matches a language-tag if a prefix of the
   language-range matches a prefix of the language-tag, such that
   for both prefixes, the prefix is equal to the whole identifier
   or the first character following the prefix is "-".

There is the possibility that this goes too far. In the case of
matching en-us with en-uk, it makes sense. But if we consider
generic prefixes, such as "x-" for experimental, it wouldn't
make sense to just return any kind of experimentally defined
language just because the user has specified one particular
kind of such tag. Matching of prefixes does not imply that
the denoted languages are mutually intellegible. So an alternative
would be:

   A language-range and a language-tag match if they are equal, or
   if a prefix of one of them exactly equals the other, such that
   the first character following the prefix is "-".



Regards,	Martin.
Received on Thursday, 9 January 1997 05:51:05 UTC