Re: [File API: FileSystem] Path restrictions and case-sensitivity from Eric U on 2011-05-22 (public-webapps@w3.org from April to June 2011)

From: Eric U <ericu@google.com>
Date: Sun, 22 May 2011 10:41:07 -0700
To: timeless <timeless@gmail.com>
Cc: Jonas Sicking <jonas@sicking.cc>, Glenn Maynard <glenn@zewt.org>, Web Applications Working Group WG <public-webapps@w3.org>, Charles Pritchard <chuck@jumis.com>, Kinuko Yasuda <kinuko@google.com>
Message-ID: <BANLkTimLOiKepzrfz3Hi5V7oZaRX9JJgKw@mail.gmail.com>
On Thu, May 12, 2011 at 1:34 AM, timeless <timeless@gmail.com> wrote:
> On Thu, May 12, 2011 at 3:02 AM, Eric U <ericu@google.com> wrote:
>> There are a few things going on here:
>
> yes
>
>> 1) Does the filesystem preserve case?  If it's case-sensitive, then
>> yes.  If it's case-insensitive, then maybe.
>> 2) Is it case-sensitive?  If not, you have to decide how to do case
>> folding, and that's locale-specific.  As I understand it, Unicode
>> case-folding isn't locale specific, except when you choose to use the
>> Turkish rules, which is exactly the problem we're talking about.
>> 3) If you're case folding, are you going to go with a single locale
>> everywhere, or are you going to use the locale of the user?
>> 4) [I think this is what you're talking about w.r.t. not allowing both
>> dotted and dotless i]: Should we attempt to detect filenames that are
>> /too similar/ for some definition of /too similar/, ostensibly to
>> avoid confusing the user.
>
>> As I read what you wrote, you wanted:
>> 1) yes
>
> correct
>
>> 2) no
>
> correct
>
>> 3) a new locale in which I, ı, I and i all fold to the same letter, everywhere
>
> I'm pretty sure Unicode's locale insensitive behavior is precisely
> what i want. I've included the section from Unicode 6 at the end.
>
>> 4) yes, possibly only for the case of I, ı, I and i
>
>> 4 is, in the general case, impossible.
>
> yes.
>
>> It's not well-defined, and is just as likely to cause problems as solve them.
>
> There are some defined ways to solve them (accepting that perfect is
> the enemy of the good),
> - one is to take the definitions of "too similar" selected for idn
> registration...
> - another is to just accept the recommendation from unicode 6 "text
> can be normalized to Normalization Form NFKC or NFKD after case
> folding"
>
>> If you *just* want to
>> check for ı vs. i, it's possible, but it's still not clear to me that
>> what you're doing will be the correct behavior in Turkish locales [are
>> there any Turkish words, names abbreviations, etc. that only differ in
>> that character?]
>
> Well, the classic example of this is "sıkısınca" / "sikisince" [1],
> but technically those differ in more than just the 'i' (they differ in
> the a/e at the end).
>
> My point is that if two things differ by such a small thing, it's
> better to force them to have visibly different names, this could be a
> '(2)' tacked onto the end of a file if the name is auto generated, or
> if the name is something a human is picking, it could be "please pick
> another name, it looks too close to <preview of other object> <object
> name>".

This again is really oriented towards the file-picker use case which
we've agreed [I think?] isn't the most common use case.  Most of the
time we expect the filenames to be generated by an application that's
using the filesystem for a backing store.  Changing the filenames out
from under it 1) won't improve anything; 2) may break things.

Given that we're talking about a problem that's subjective and thus
can't really be "solved", and the solution you propose is so
complicated, I really don't see that this is a win over just saying
"we support all valid UTF-8 sequences; build whatever you want on top
of that".  There are ways to add some of the behavior you're asking
for in JavaScript libraries on top, as long as you're willing to have
a central coordinator for your filesystem access.  Let's let people
experiment with that as they wish.

It appears to me that a majority of those who've spoken up support
this conclusion, and will try to update the spec this week.  As
before, I'm still only speccing out the sandboxed filesystem, so
expansions into access outside the sandbox, and serialization of these
filenames into local filesystem names, can be dealt with later.

> The other instances I've run into all seem to be cases where there's a
> canonical spelling and then a "folded for Latin users" writing. I
> certainly can't speak for all cases.
>
>> and it doesn't matter elsewhere.
>
> Actually, i think we ended up trying to compile blacklists while
> developing punycode [2] for IDN [3]. I guess rfc 4290 [4], 4713 [5],
> 5564 [6], and 5992 [7], have tables which while not complete are
> certainly referencable, and given that UAs already have to deal with
> punycode, it's likely that they'd have access to those tables.
>
> I think the relevant section from unicode 6 [8] is probably 5.18 Case
> Mappings (page 171?)
>> Where case distinctions are not important, other distinctions between Unicode characters
>> (in particular, compatibility distinctions) are generally ignored as well. In such circumstances,
>> text can be normalized to Normalization Form NFKC or NFKD after case folding,
>> thereby producing a normalized form that erases both compatibility distinctions and case
>> distinctions.
>
> I think this is probably what I want
>
>> However, such normalization should generally be done only on a restricted
>> repertoire, such as identifiers (alphanumerics).
>
> Yes, I'm hand waving at this requirement - filenames are in a way
> identifiers, you aren't supposed to encode an essay in a filename.
>
>> See Unicode Standard Annex #15, “Unicode
>> Normalization Forms,” and Unicode Standard Annex #31, “Unicode Identifier and
>> Pattern Syntax,” for more information. For a summary, see “Equivalent Sequences” in
>> Section 2.2, Unicode Design Principles.
>
>> Caseless matching is only an approximation of the language-specific rules governing the
>> strength of comparisons. Language-specific case matching can be derived from the collation
>> data for the language, where only the first- and second-level differences are used. For
>> more information, see Unicode Technical Standard #10, “Unicode Collation Algorithm.”
>
> Of note:
>> In most environments, such as in file systems, text is not and cannot be tagged with language
>> information. In such cases, the language-specific mappings must not be used.
>
>> Otherwise,
>> data structures such as B-trees might be built based on one set of case foldings and
>> used based on a different set of case foldings. This discrepancy would cause those data
>> structures to become corrupt.
>
> Of note:
>> For such environments, a constant, language-independent, default case folding is required.
>
> On the subject of file names and encodings, I can't recall if you
> specified normalization equivalence. From memory OS X and Windows
> disagree on decomposition-encoding or something (please forgive me for
> not having the right terms here, I can look them up in a later email).
> I think that addressing the concerns above as suggested would address
> this too.
>
> [1] http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail
> [2] http://en.wikipedia.org/wiki/Punycode
> [3] http://www.mozilla.org/projects/security/tld-idn-policy-list.html
> [4] http://tools.ietf.org/html/rfc4290
> [5] http://tools.ietf.org/html/rfc4713
> [6] http://tools.ietf.org/html/rfc5564
> [7] http://tools.ietf.org/html/rfc5992
> [8] http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
>
Received on Sunday, 22 May 2011 17:41:49 UTC