W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > July to September 2007

[Bug 3079] RFC3066 ref

From: <bugzilla@wiggum.w3.org>
Date: Wed, 19 Sep 2007 05:50:24 +0000
To: www-xml-schema-comments@w3.org
Message-Id: <E1IXsS8-0004pg-SZ@wiggum.w3.org>


fsasaki@w3.org changed:

           What    |Removed                     |Added
                 CC|                            |fsasaki@w3.org

------- Comment #2 from fsasaki@w3.org  2007-09-19 05:50 -------
Hello Michael,
We discussed this issue at http://www.w3.org/2007/09/18-core-minutes#item07 .
We would like to propose that you use the ABNF defined in RFC 4646. This ABNF
is stable. The updates of BCP 47 (which will lead to a new RFC obsoleting RFC
4646) are only about adoption of certain values for the extlang subtag, see
http://tools.ietf.org/html/rfc4646#section-2.2.2 and the charter of the LTRU WG
at http://www.ietf.org/html.charters/ltru-charter.html .
Mainly terms of references, I would propose the following changes in sec.
/START proposal sec. 3.4.3/
[Definition:]   language represents formal natural language identifiers, as
defined by [BCP 47]. The value space and lexical space of language are the set
of all strings that conform to the ABNF

        (here RFC 4646 grammar)

This is the set of strings accepted by the grammar given in [RFC 4646], the RFC
which currently represents [BCP 47]. The base type of language is token.
Note: The regular expression above provides the only normative constraint on
the lexical and value spaces of this type. The additional constraints imposed
on language identifiers by [BCP 47], and in particular their requirement that
language codes be registered with IANA or ISO if not given in ISO 639, are not
part of this datatype as defined here.
Note: [BCP 47] specifies that language tags and sub tags "are to be treated as
case insensitive: there exist conventions for the capitalization of some of the
subtags, but these MUST NOT be taken to carry meaning." For instance, [ISO
3166] recommends that country codes are capitalized (MN Mongolia), while [ISO
639] recommends that language codes are written in lower case (mn Mongolian).
Since the language datatype is derived from string, it inherits from string a
one-to-one mapping from lexical representations to values. The literals 'MN'
and 'mn' therefore correspond to distinct values and have distinct canonical
forms. Users of this specification should be aware of this fact, the
consequence of which is that the case-insensitive treatment of language values
prescribed by [BCP 47] does not follow from the definition of this datatype
given here; applications which require case-sensitivity should make appropriate
/END proposal sec. 3.4.3/
Since the RFC 3066 ABNF was rather lax and users were not punished for
producing useless language tags (like "English-England"), we see the danger
that the more restrictive grammar of RFC 4646 leads to more useful, but
unexpected results. To make people aware of this situation, I would propose the
following note as a health warning, with a non-normative reference to RFC 3066:
"The ABNF defined in the predecessor of RFC 4646, RFC 3066, was rather lax.
Users were not punished for producing ABNF-compliant, but otherwise useless
language tags. In contrast, the more restrictive grammar in RFC 4646 is more
appropriate for creating language tags. However, users need to be warned that
due to the lax ABNF of RFC 3066, they might get unexpected results than
processing legacy data."

Received on Wednesday, 19 September 2007 05:50:29 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:50:06 UTC