re: prefer-language tag from Martin J. Duerst on 1998-02-23 (ietf-charsets@w3.org from January to March 1998)

From: Martin J. Duerst <duerst@w3.org>
Date: Mon, 23 Feb 1998 17:46:11 +0900
To: Mark Crispin <MRC@CAC.Washington.EDU>, Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
Cc: ietf-languages@apps.ietf.org, ietf-charsets@INNOSOFT.COM
Message-id: <199802230900.SAA13080@sh.w3.mag.keio.ac.jp>
At 14:16 98/02/18 -0800, Mark Crispin wrote:

> On Wed, 18 Feb 1998 17:04:10 -0500, Marc Blanchet wrote:


> > Well, I don't think this is an issue in my draft: yes this problem is
> > difficult (technically speaking), but it has been discussed in RFC 1766,
> > which my draft is refering too. I think this dialect issue is more related
> > to RFC 1766 than my draft.
> 
> Unfortunately, it has to be considered anew with your concept since new issues
> are raised.  RFC 1766 simply labels data; it does not apply user preferences.
> So there's no need to worry about how the dialects interact.

Well, RFC 1766 defines languages tags and one application for them, namely
the Content-Language header for email. Just because that header simply
labels data, it does not mean that RFC 1766 is limited to labeling data.
But RFC 1766 says that labels are to be considered as undivisable tokens,
which means that if you apply it directly, there is no "dialect processing"
at all.

That's why starting with RFC 2070 (HTML i18n), and going on to HTML 4.0,
XML, CSS2, and HTTP 1.1, this has been changed to allow some degree of
"dialect processing".


> I think that you should consider the question of being able to convey multiple
> tag/value sets within the same token.  It is, for example, extremely common to
> want to establish language, locale, and possibly also culture at the same
> time.  A tagging architecure can be a benefit over commands if it can do
> multiple tasks at a time.

Such stuff has been investigated in various ways for HTTP. In addition to
language, issues such as desired formats, screen size, availability of color,...
have been worked on or are worked on. Some simple things work rather well,
others are highly experimental or questionable.

HTTP, for the issues discussed here, is of course much simpler than mail,
there is much less of a layer distinction,...


> In the case of this particular tag, you absolutely must detail how it
> interacts with dialects.  As an implementor, I insist upon it.  Without a
> precise specification to fall back on, implementors are left to guess, and
> that leads to user confusion and anger.

Yes, very true. Here HTTP can provide quite a bit of help. Interestingly
enough, it specifies exactly the same solution as Mark is proposing:

Match requested short tag (e.g. fr) against available long tag (fr-ca),
but not the other way round.

> Here's what I think the behavior should be:
> 1) If the user requests a "generic" form of the language, it will match either
>    a server's "generic" form or a dialect of the server's choosing.
> 2) If the user requests a specific dialect of the language, it will match
>    either that dialect on the server or a generic form offered by the server,
>    but *NOT* any other dialect.


There are basically four possible solutions:

(1) Only exact match

(2) Match short request against long available

(3) Match long request against short available

(4) Match everything with the same prefix

Solution (4) does not work because of cases such as x-, i-, and zh-;
Solution (1) needs a lot of foresight from the side of the user,
or a lot of restriction from the side of the data provider.
Solution (3) has similar problems to (1). (2) seems to provide
most in terms of shortness of transfers and available matching
functionality, while preserving the will of the user.


Below is the relevant text from the newest HTTP 1.1 draft:

>>>>>>>>
   14.4 Accept-Language

       The Accept-Language request-header field is similar to Accept, but
       restricts the set of natural languages that are preferred as a response
       to the request.

              Accept-Language = "Accept-Language" ":"
                                1#( language-range [ ";" "q" "=" qvalue ] )

              language-range  = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" )
       Each language-range MAY be given an associated quality value which
       represents an estimate of the user's preference for the languages
       specified by that range. The quality value defaults to "q=1". For
       example,

              Accept-Language: da, en-gb;q=0.8, en;q=0.7
       would mean: "I prefer Danish, but will accept British English and other
       types of English." A language-range matches a language-tag if it exactly
       equals the tag, or if it exactly equals a prefix of the tag such that
       the first tag character following the prefix is "-". The special range
       "*", if present in the Accept-Language field, matches every tag not
       matched by any other range present in the Accept-Language field.

         Note: This use of a prefix matching rule does not imply that
         language tags are assigned to languages in such a way that it is
         always true that if a user understands a language with a certain
         tag, then this user will also understand all languages with tags
         for which this tag is a prefix. The prefix rule simply allows the
         use of prefix tags if this is the case.

       The language quality factor assigned to a language-tag by the Accept-
       Language field is the quality value of the longest language-range in the
       field that matches the language-tag. If no language-range in the field
       matches the tag, the language quality factor assigned is 0. If no
       Accept-Language header is present in the request, the server SHOULD
       assume that all languages are equally acceptable. If an Accept-Language
       header is present, then all languages which are assigned a quality
       factor greater than 0 are acceptable.

       It may be contrary to the privacy expectations of the user to send an
       Accept-Language header with the complete linguistic preferences of the
       user in every request. For a discussion of this issue, see section 15.6.

         Note: As intelligibility is highly dependent on the individual
         user, it is recommended that client applications make the choice of
         linguistic preference available to the user. If the choice is not
         made available, then the Accept-Language header field must not be
         given in the request.

         Note: When making the choice of linguistic preference available to
         the user, implementors should take into account the fact that users
         are not familiar with the details of language matching as described
         above, and should provide appropriate guidance. As an example,
         users may assume that on selecting "en-gb", they will be served any
         kind of English document if British English is not available. A
         user agent may suggest in such a case to add "en" to get the best
         matching behaviour.
<<<<<<<<

Please also have a look at the last note. This is an UI issue, but one
that is not obvious to implementors or users. This helps to avoid
the "surprise" cases that Marc listed:


> Expressed as a table, we have the following (including a couple of surprises):
> 
> 		    What appears in the tag:
>                 FR-CA,FR,EN FR-CA,EN    FR-FR       FR
> Server has:     ----------- --------    -----       -----
> FR-CA,FR-FR,EN  FR-CA       FR-CA       FR-FR       FR-CA
> FR-CA,FR,EN     FR-CA       FR-CA       FR          FR
> FR-FR,FR,EN     FR          FR          FR-FR       FR
> FR-CA,EN        FR-CA       FR-CA       EN (!)      FR-CA
> FR-FR,EN        FR-FR       EN (!)      FR-FR       FR-FR
> FR,EN           FR          FR          FR          FR
> EN              EN          EN          i-default   i-default
> 
> The surprises, marked by "(!)", came about because the user requested a
> dialect that the server did not have, and the server did not offer a generic
> form.
> 
> But, although this is technically the most reasonable and flexible answer, it
> is not immediately obvious to anyone.  In fact, the behavior appears wrong at
> first glace.
> 
> That's why it has to be specified.  Or there will be user confusion and anger.

Exactly. HTTP already specifies it this way, and it would be nice if others
would follow. The "q" factors of HTTP are not needed; they are due to the
fact that for HTTP, long documents for which a very varying quality of
translation is assumed must be served. For error messages, it might be
safe to assume a relatively constant translation quality.


> It also leads to the conclusion that servers SHOULD offer a generic form for
> all the languages it offers.  That would eliminate the two rows in which there
> are surprises.  Clients can also avoid the surprise by always requesting the
> generic form as well.

Yes, these are the two solutions for avoiding surprises. Both work. Only
one is needed. It helps if the the spec says which one. Having clients
try to avoid surprises is slightly better, because it also handles cases
where for some users, the general solution works, whereas for others,
it doesn't. A typical case here is Chinese. zh-cn usually denotes
simplified script, and zh-tw denotes traditional script. For some
readers, it doesn't make much of a difference (although they prefer
one way over the other). Others are only used to one or the other
variant.


Regards,   Martin.


--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Wednesday, 25 February 1998 00:39:05 UTC