- From: <bugzilla@wiggum.w3.org>
- Date: Wed, 19 Sep 2007 05:50:24 +0000
- To: www-xml-schema-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3079 fsasaki@w3.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fsasaki@w3.org ------- Comment #2 from fsasaki@w3.org 2007-09-19 05:50 ------- Hello Michael, We discussed this issue at http://www.w3.org/2007/09/18-core-minutes#item07 . We would like to propose that you use the ABNF defined in RFC 4646. This ABNF is stable. The updates of BCP 47 (which will lead to a new RFC obsoleting RFC 4646) are only about adoption of certain values for the extlang subtag, see http://tools.ietf.org/html/rfc4646#section-2.2.2 and the charter of the LTRU WG at http://www.ietf.org/html.charters/ltru-charter.html . Mainly terms of references, I would propose the following changes in sec. 3.4.3: /START proposal sec. 3.4.3/ [Definition:] language represents formal natural language identifiers, as defined by [BCP 47]. The value space and lexical space of language are the set of all strings that conform to the ABNF (here RFC 4646 grammar) This is the set of strings accepted by the grammar given in [RFC 4646], the RFC which currently represents [BCP 47]. The base type of language is token. Note: The regular expression above provides the only normative constraint on the lexical and value spaces of this type. The additional constraints imposed on language identifiers by [BCP 47], and in particular their requirement that language codes be registered with IANA or ISO if not given in ISO 639, are not part of this datatype as defined here. Note: [BCP 47] specifies that language tags and sub tags "are to be treated as case insensitive: there exist conventions for the capitalization of some of the subtags, but these MUST NOT be taken to carry meaning." For instance, [ISO 3166] recommends that country codes are capitalized (MN Mongolia), while [ISO 639] recommends that language codes are written in lower case (mn Mongolian). Since the language datatype is derived from string, it inherits from string a one-to-one mapping from lexical representations to values. The literals 'MN' and 'mn' therefore correspond to distinct values and have distinct canonical forms. Users of this specification should be aware of this fact, the consequence of which is that the case-insensitive treatment of language values prescribed by [BCP 47] does not follow from the definition of this datatype given here; applications which require case-sensitivity should make appropriate adjustments. /END proposal sec. 3.4.3/ Since the RFC 3066 ABNF was rather lax and users were not punished for producing useless language tags (like "English-England"), we see the danger that the more restrictive grammar of RFC 4646 leads to more useful, but unexpected results. To make people aware of this situation, I would propose the following note as a health warning, with a non-normative reference to RFC 3066: "The ABNF defined in the predecessor of RFC 4646, RFC 3066, was rather lax. Users were not punished for producing ABNF-compliant, but otherwise useless language tags. In contrast, the more restrictive grammar in RFC 4646 is more appropriate for creating language tags. However, users need to be warned that due to the lax ABNF of RFC 3066, they might get unexpected results than processing legacy data." HTH, Felix
Received on Wednesday, 19 September 2007 05:50:29 UTC