Re: Case of language tags from Andy Seaborne on 2013-02-28 (public-rdf-wg@w3.org from February 2013)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 28 Feb 2013 19:34:33 +0000
To: Peter Patel-Schneider <pfpschneider@gmail.com>
CC: RDF-WG <public-rdf-wg@w3.org>
Message-ID: <512FB149.5020407@epimorphics.com>

Peter,

If the rule for lower casing is qualified by US-ASCII it would be OK
as it meets:

"""
Implementers SHOULD specify a locale-neutral
    casing operation to ensure that case folding of subtags does not
    produce this value, which is illegal in language tags.
"""
[*] "this value" is the upper case situation.

and the current text is a bit better than 2004 concepts where the 
case-changing was separate from the RFC 3066 mention.  The problem only 
arises if the transformation to lower case is separate from the RFC 3006 
reference.

There is a canonicalization algorithm in 2.1.1

"""
An implementation can reproduce this format without accessing the
    registry as follows.  ....
"""

(Didn't know about the Lithuanian and Azeri issues)

On 28/02/13 18:07, Peter Patel-Schneider wrote:
> I'm not an expert in BCP47, and going through the grammar is painful
> (what *is* ALPHA?).
>
> However, it sure seems to me that language tags are US-ASCII characters,
> and BCP47 itself talks about upper and lower case (boy is that ever an
> old notion!).  It thus seems to me that what is meant is perfectly clear
> in terms of BCP47, which even has a similar warning about how to change
> case in language tags.  If the WG wanted to be more pendantic then the
> document could say something like, "does not contain any uppercase
> US-ASCII letters - any uppercase US-ASCII letters in surface syntaxes
> MUST be normalized into their US-ASCII lowercase equivalents".
>
> I think that just saying to treat the language tag (case?) insensitively
> ends up with the same question as transforming to lower case.

You would not be lower casing and exporting changed data if you retain 
the original and do a local sensitive comparison of strings.

The world will not fall apart because of this ... but it has happened in 
the real world:

https://issues.apache.org/jira/browse/JENA-407

 Andy

>
> peter
>
> On Thu, Feb 28, 2013 at 9:26 AM, Andy Seaborne
> <andy.seaborne@epimorphics.com <mailto:andy.seaborne@epimorphics.com>>
> wrote:
>
>
>     Section 3.3: (of the editors draft):
>
>     """
>     a non-empty language tag as defined by [BCP47]. The language tag
>     must be well-formed according to section 2.2.9 of [BCP47], and must
>     be normalized to lowercase.
>     """
>
>     but "lowercase" is locale sensitive.
>
>     What is lower case "I"?  It's not always "i".
>
>     It isn't in Turkish where there are different dotted and dotless
>     I-like letters.
>
>     Upper case "I" (U+0049); lower case "ı" (U+0131)
>     !=
>     Upper case "İ" (U+0130); lower case "i" (U+0049)
>
>     http://www.i18nguy.com/__unicode/turkish.png
>     <http://www.i18nguy.com/unicode/turkish.png>
>
>     The ideal solution is to say that the language tag is to be treated
>     as case insensitively.
>
>              Andy
>
>     (this email is in UTF-8)
>
>

Received on Thursday, 28 February 2013 19:35:05 UTC