Re: Constraining xml:lang - Catch 22 from Jack Lindsey on 2003-12-19 (xmlschema-dev@w3.org from December 2003)

From: Jack Lindsey <tuquenukem@hotmail.com>
Date: Fri, 19 Dec 2003 13:29:54 -0500
To: ht@cogsci.ed.ac.uk
Cc: xmlschema-dev@w3.org
Message-ID: <Law15-F98XrJrozCsVI000033e1@hotmail.com>
>From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
>Just what are you trying to rule out?  The regulatory situation
>regarding language codes, as spelled out in RFC 3066 [1], is
>sufficiently complicated that the lexical space constraint given in
>the schema REC (as amended) for the xs:language type [2] is really the
>strictest it's practical to enforce.  With IANA having registered
>e.g. cel-gaulish and de-AT-1901 as legal tags, there's really not much
>we can do here.
>
> > I love this, from "http://www.w3.org/2001/xml.xsd"
>
><snip/>
>
>The comment will be removed when the above-cited erratum is formally
>encorporated in the 2nd edition of the Schema REC.
>
>ht
>
>[1] http://www.ietf.org/rfc/rfc3066.txt
>[2] http://www.w3.org/2001/05/xmlschema-errata#e2-25

Understood.

We have just made a limited publication of an XML vocabulary for 
standardized data exchange within a government sector in an officially 
bilingual jurisdiction.

Let me take this opportunity to thank the participants of this list for all 
the help they have both consciously and unwittingly given me over the last 
year.  In particular, I would like to thank Henry and Jeni for their 
invaluable advice which is much appreciated and much implemented (I do not 
yet have permission to publish a link).

We have decreed that text generated by our current partners (as opposed to 
obtained from external sources) should use the values:

en-CA		(Canadian English)
fr-CA		(Canadian French)

Other potential values might in future include:

en-GB		(British English)
en-US		(American English)
es-MX		(Spanish)
iu		(Inuktitut - would require UTF-16)

This is for all the usual, anticipated page reader, translation software, 
character set rendering reasons.  But in addition, we make extensive use of 
coded information, specifed either as terse, language-neutral values or 
language-specific texts, for which we are going to provide "code table 
lookup" facilities (In ISO 11179-3 terminology: cross-references between 
permissable value instances of related value domains, e.g. from ISO 3166 
Country Code (3-digit numeric) to Country Short Name in English or Country 
Short Name in French (actually the majority of our codes are home-grown)).

For this purpose, only en-CA and fr-CA (the official languages) are 
relevant, but we did not want to use multiple language identification 
techniques, especially since in practice these are probably the only 
languages which will show up in any context for the first few early years.

So, since in the near future the format of xml:lang values will be 
validated, I imagine (thinking aloud - comments welcomed) I will allow page 
readers, translation software, etc. to make what they will of any values 
that show up.  But in my table-lookup XSLT templates I will interpret as 
English "en" or anything beginning "en-", case-insensitive, the same for 
"fr", and default to English for anything else.

Cheers Jack

_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online  
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
Received on Friday, 19 December 2003 13:37:53 UTC