Re: I18N issue: case-sensitivity of locale subdirectories from Marcos Caceres on 2009-06-08 (public-webapps@w3.org from April to June 2009)

From: Marcos Caceres <marcosc@opera.com>
Date: Mon, 8 Jun 2009 20:34:06 +0200
To: "Jere.Kapyaho" <Jere.Kapyaho@nokia.com>
Cc: robin <robin@berjon.com>, public-webapps <public-webapps@w3.org>
Message-ID: <b21a10670906081134k74826d22i78f0ad8c25bfa12f@mail.gmail.com>
On Thu, May 7, 2009 at 12:33 PM,  <Jere.Kapyaho@nokia.com> wrote:
> On 5.5.2009 13.16, "ext Marcos Caceres" <marcosc@opera.com> wrote:
>> On Wed, Apr 29, 2009 at 4:16 PM, Robin Berjon <robin@berjon.com> wrote:
>>> Assume we have two localisation subdirectories:
>>>
>>>  locales/en/
>>>  locales/EN/
>>>
>>> What happens? BCP47 (which we reference) is defined to be case-insensitive
>>> so it doesn't help us much in this respect.
>>>
>>> There are multiple options:
>>>
>>>  a) we define a canonical casing and all others are ignored;
>>>  b) we select an order of priority and we only consider one (the first to
>>> match);
>>>  c) we select an order of priority and we merge them all (in that order,
>>> with a given precedence rule);
>>>  d) the device on which the user agent is catches fire.
>>>
>>> I think that (a) should be ruled out because as BCP47 tells us, ISO639-1
>>> recommends lowercase (language codes), ISO3166-1 recommends uppercase
>>> (country codes), and ISO15924 recommends titlecase (script codes). These are
>>> different, but likely to be confusing, and I don't think that developers
>>> should have to worry about that.
>>
>> Agreed.
>
> Because BCP47 is indeed case-insensitive [1], both "en" and "EN" (and also
> "eN" and "En") are considered equivalent.

In the spec, the problem is actually simpler (worst!) than that:

 1. authors can create folders that make use of language tags in any
case form ("locales/EN-us" etc.).
 2. the UA locale is in lowercase form and only attempt to match
folder names in lowecase form.

So, for instance, UA locale is "en-us" so it won't match "en-US" or
any case variant.

> While it is probably an oversight
> or error to provide several variants of the same language tag with different
> character case anyway, they need to be considered somehow because they *are*
> equivalent, unless it is made explicit that this is an error in the
> packaging.

This is too harsh, IMO. We need to deal with this in the spec and not
punish authors.

> The path inside the widget's ZIP file is already defined as
> case-insensitive, so it is actually already an error to have two or more
> folders with names that differ only by character case.

Right. It is possible to do this, but requires some skills (in
Windows) or a case insensitive OS.

>  Even if some
> implementation unzips the content of the widget to a local filesystem, we
> have no control over whether filenames in that filesystem are
> case-insensitive or case-sensitive.

Right. As I already stated, in the spec, we convert user agent locales
list to lower case. However, paths are treated as case sensitive, so
'en-us' will not match 'en-US' at runtime. This is an issue, and not
sure what to do about that.

>>> I don't have a strong opinion on this, but I do I have a preference for a
>>> rule based on (b): if multiple locale subdirectories have the same
>>> case-insensitive name, then the one that comes first in ASCII-code order
>>> (e.g. in order: EN, En, eN, en) is used and the others are ignored.
>>
>> This seems reasonable. I will add this.
>
> I suggest that the widget packaging rules say that any localized folders
> must be unique in terms of a case-insensitive match, otherwise the packaging
> is invalid [2].

Like I said, I think it's a bit harsh to say that the package is
invalid. I think a conformance checker should warn authors to make
localized folders lower case OR we deal with it in the spec.

> This also allows us to not talk about ASCII code ordering.

Right.

> Furthermore, there is then no need to merge the contents of such folders.

Exactly.

> For the degenerate (but unfortunately unavoidable) case where someone has
> managed to slip in two or more such folders, define a canonical casing
> (obvious suggestion: lowercase) and use it, then simply ignore any others.

In the zip file, there are no folders per se. There are only zip
relative paths that act as identifiers. It's only upon decompression
that physical folders may be created (or merged). But, like I said,
the problem is more fundamental than what is being described because
the user agent's locales are in lower case form.

>>> The argument in favour of only using one is that we already have to merge
>>> multiple directories, and adding one merge operation for what is in all
>>> probability a user error seems like too much complexity for little value
>>> (I'm happy to be contradicted by implementers however). Picking ASCII-code
>>> order is based on the fact that the directory names must be ASCII here (the
>>> others must be discarded), and picking the first is arbitrary.
>>>
>>> Thoughts?
>>
>> I support b. Added some of your text above to the spec.
>
> I guess none of a)-d) really fit my observations as such. It's more like
> additional packaging rules + shades of a).
>
> Note that for comparisons with the widget locale value you still need to
> case-fold [3] everything anyway. There is no guarantee that the widget
> locale matches any localized subfolder name as such, because the widget
> locale itself could use capitalization that really carries no meaning, but
> fails to match any localized folder unless you do a case-insensitive
> comparison. In this case the comparison can be also language-insensitive,
> because BCP47 language tags consist of US-ASCII characters.

Yep. Need to fix this.

-- 
Marcos Caceres
http://datadriven.com.au
Received on Monday, 8 June 2009 18:34:46 UTC