Re: [File API: FileSystem] Path restrictions and case-sensitivity

On Thu, May 12, 2011 at 3:02 AM, Eric U <ericu@google.com> wrote:
> There are a few things going on here:

yes

> 1) Does the filesystem preserve case?  If it's case-sensitive, then
> yes.  If it's case-insensitive, then maybe.
> 2) Is it case-sensitive?  If not, you have to decide how to do case
> folding, and that's locale-specific.  As I understand it, Unicode
> case-folding isn't locale specific, except when you choose to use the
> Turkish rules, which is exactly the problem we're talking about.
> 3) If you're case folding, are you going to go with a single locale
> everywhere, or are you going to use the locale of the user?
> 4) [I think this is what you're talking about w.r.t. not allowing both
> dotted and dotless i]: Should we attempt to detect filenames that are
> /too similar/ for some definition of /too similar/, ostensibly to
> avoid confusing the user.

> As I read what you wrote, you wanted:
> 1) yes

correct

> 2) no

correct

> 3) a new locale in which I, ı, I and i all fold to the same letter, everywhere

I'm pretty sure Unicode's locale insensitive behavior is precisely
what i want. I've included the section from Unicode 6 at the end.

> 4) yes, possibly only for the case of I, ı, I and i

> 4 is, in the general case, impossible.

yes.

> It's not well-defined, and is just as likely to cause problems as solve them.

There are some defined ways to solve them (accepting that perfect is
the enemy of the good),
- one is to take the definitions of "too similar" selected for idn
registration...
- another is to just accept the recommendation from unicode 6 "text
can be normalized to Normalization Form NFKC or NFKD after case
folding"

> If you *just* want to
> check for ı vs. i, it's possible, but it's still not clear to me that
> what you're doing will be the correct behavior in Turkish locales [are
> there any Turkish words, names abbreviations, etc. that only differ in
> that character?]

Well, the classic example of this is "sıkısınca" / "sikisince" [1],
but technically those differ in more than just the 'i' (they differ in
the a/e at the end).

My point is that if two things differ by such a small thing, it's
better to force them to have visibly different names, this could be a
'(2)' tacked onto the end of a file if the name is auto generated, or
if the name is something a human is picking, it could be "please pick
another name, it looks too close to <preview of other object> <object
name>".

The other instances I've run into all seem to be cases where there's a
canonical spelling and then a "folded for Latin users" writing. I
certainly can't speak for all cases.

> and it doesn't matter elsewhere.

Actually, i think we ended up trying to compile blacklists while
developing punycode [2] for IDN [3]. I guess rfc 4290 [4], 4713 [5],
5564 [6], and 5992 [7], have tables which while not complete are
certainly referencable, and given that UAs already have to deal with
punycode, it's likely that they'd have access to those tables.

I think the relevant section from unicode 6 [8] is probably 5.18 Case
Mappings (page 171?)
> Where case distinctions are not important, other distinctions between Unicode characters
> (in particular, compatibility distinctions) are generally ignored as well. In such circumstances,
> text can be normalized to Normalization Form NFKC or NFKD after case folding,
> thereby producing a normalized form that erases both compatibility distinctions and case
> distinctions.

I think this is probably what I want

> However, such normalization should generally be done only on a restricted
> repertoire, such as identifiers (alphanumerics).

Yes, I'm hand waving at this requirement - filenames are in a way
identifiers, you aren't supposed to encode an essay in a filename.

> See Unicode Standard Annex #15, “Unicode
> Normalization Forms,” and Unicode Standard Annex #31, “Unicode Identifier and
> Pattern Syntax,” for more information. For a summary, see “Equivalent Sequences” in
> Section 2.2, Unicode Design Principles.

> Caseless matching is only an approximation of the language-specific rules governing the
> strength of comparisons. Language-specific case matching can be derived from the collation
> data for the language, where only the first- and second-level differences are used. For
> more information, see Unicode Technical Standard #10, “Unicode Collation Algorithm.”

Of note:
> In most environments, such as in file systems, text is not and cannot be tagged with language
> information. In such cases, the language-specific mappings must not be used.

> Otherwise,
> data structures such as B-trees might be built based on one set of case foldings and
> used based on a different set of case foldings. This discrepancy would cause those data
> structures to become corrupt.

Of note:
> For such environments, a constant, language-independent, default case folding is required.

On the subject of file names and encodings, I can't recall if you
specified normalization equivalence. From memory OS X and Windows
disagree on decomposition-encoding or something (please forgive me for
not having the right terms here, I can look them up in a later email).
I think that addressing the concerns above as suggested would address
this too.

[1] http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail
[2] http://en.wikipedia.org/wiki/Punycode
[3] http://www.mozilla.org/projects/security/tld-idn-policy-list.html
[4] http://tools.ietf.org/html/rfc4290
[5] http://tools.ietf.org/html/rfc4713
[6] http://tools.ietf.org/html/rfc5564
[7] http://tools.ietf.org/html/rfc5992
[8] http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

Received on Thursday, 12 May 2011 08:40:51 UTC