Re: Question on implementation of the language property from Éric Bischoff on 2002-07-24 (www-xsl-fo@w3.org from July 2002)

From: Éric Bischoff <e.bischoff@noos.fr>
Date: Wed, 24 Jul 2002 02:02:33 +0200
To: Paul Grosso <pgrosso@arbortext.com>, www-xsl-fo@w3.org
Message-Id: <200207240202.33117.e.bischoff@noos.fr>
On Wednesday 24 July 2002 01:28, Éric Bischoff wrote:
> On Tuesday 23 July 2002 23:05, Paul Grosso wrote:
> > The XSL FO subgroup has discussed an issue regarding the allowable
> > values of the language property [1] (see [2] for the comment).
> >
> > At least some WG members believe the XSL spec should require
> > the use of 3 character codes for the language property as these
> > are clear and unambiguous (at least if Terminology values
> > are used when there is a conflict between those and the
> > Bibliographic values as is required by RFC 3066 [2]).
> >
> > Others believe 2 character values (allowed by RFC 3066 and
> > allowed as values for the xml:lang shorthand [4]) should also
> > be allowed as values for the language property.
> >
> > Note that, in any case, both 2 and 3 character values are
> > allowed for values of the xml:lang shorthand--that is not
> > in question (since it is defined by the XML 1.0 spec [5]).

Okay, I've found the number of the RFC that says which code to use. It happens 
to be the very same RFC 3066 that XSL-FO specification references!

_______________________________________________________
   2. When a language has both an ISO 639-1 2-character code and an ISO
      639-2 3-character code, you MUST use the tag derived from the ISO
      639-1 2-character code.
_______________________________________________________

I was wrong when I've said in my previous message that RFC 3066 gives no 
preference for one encoding or the other. Sorry for that.

So the reasoning is unambiguous :
- The specification of XSL-FO relies on RFC 3066
- RFC 3066 gives the rules for chosing between 2 letters codes and 3 letters 
codes (if you have the choice, use 2 letters code)
- So documents conforming to XSL-FO should respect that rule

As I've been pointing in my previous message, this rule is idiotic because it 
makes a "de facto" mixture of two code sets instead of keeping them separate. 
But as I've pointed out too, huge projects (in size) like the KDE project 
have chosen to respect that rule.

Personally, I would however allow some tolerance and accept codes like "deu" 
and "ger", even if "de" exists. I would even allow very common constructs not 
allowed by RFC 3066 :

	"fr_FR" instead of "fr-FR"
	"de-DE@euro" ('@' sign is normally illegal)

> > If a given implementation accepts 2 character values (e.g., "EN"),
> > how are they interpreted (e.g., does "EN" mean US english,
> > British english, or something else)?
>
> I believe that this pecular point is covered by the RFC 3066 which is
> referenced from the XSL-FO specification :
> 	"en" = English
> 	"en-GB" = British English
> 	"en-US" = American English
>
> Same for 3 letters codes:
> 	"eng"
> 	"eng-GB"
> 	"eng-US"
>
> It's independant of the length of the code ;-). First mandatory part is
> language as defined in ISO-639-1 or -2, second optional part part is
> country code as defined in ISO-3166.

Also, if you ask me "Does 'en' resolve to 'en-GB' or 'en-US'?" I would answer: 
it looks like an implementation choice, or could be parametrized. After all, 
when we speak about "English", do we refer to "British English" or to 
"American English"? There seems to be no easy answer to that question. One 
could even imagine hyphenation dictionaries permitting local variants like:

	honor
	honour

-- 
Éric Bischoff
Received on Tuesday, 23 July 2002 20:01:59 UTC