- From: timeless <timeless@gmail.com>
- Date: Thu, 12 May 2011 11:34:42 +0300
- To: Eric U <ericu@google.com>
- Cc: Jonas Sicking <jonas@sicking.cc>, Glenn Maynard <glenn@zewt.org>, Web Applications Working Group WG <public-webapps@w3.org>, Charles Pritchard <chuck@jumis.com>, Kinuko Yasuda <kinuko@google.com>
On Thu, May 12, 2011 at 3:02 AM, Eric U <ericu@google.com> wrote: > There are a few things going on here: yes > 1) Does the filesystem preserve case? If it's case-sensitive, then > yes. If it's case-insensitive, then maybe. > 2) Is it case-sensitive? If not, you have to decide how to do case > folding, and that's locale-specific. As I understand it, Unicode > case-folding isn't locale specific, except when you choose to use the > Turkish rules, which is exactly the problem we're talking about. > 3) If you're case folding, are you going to go with a single locale > everywhere, or are you going to use the locale of the user? > 4) [I think this is what you're talking about w.r.t. not allowing both > dotted and dotless i]: Should we attempt to detect filenames that are > /too similar/ for some definition of /too similar/, ostensibly to > avoid confusing the user. > As I read what you wrote, you wanted: > 1) yes correct > 2) no correct > 3) a new locale in which I, ı, I and i all fold to the same letter, everywhere I'm pretty sure Unicode's locale insensitive behavior is precisely what i want. I've included the section from Unicode 6 at the end. > 4) yes, possibly only for the case of I, ı, I and i > 4 is, in the general case, impossible. yes. > It's not well-defined, and is just as likely to cause problems as solve them. There are some defined ways to solve them (accepting that perfect is the enemy of the good), - one is to take the definitions of "too similar" selected for idn registration... - another is to just accept the recommendation from unicode 6 "text can be normalized to Normalization Form NFKC or NFKD after case folding" > If you *just* want to > check for ı vs. i, it's possible, but it's still not clear to me that > what you're doing will be the correct behavior in Turkish locales [are > there any Turkish words, names abbreviations, etc. that only differ in > that character?] Well, the classic example of this is "sıkısınca" / "sikisince" [1], but technically those differ in more than just the 'i' (they differ in the a/e at the end). My point is that if two things differ by such a small thing, it's better to force them to have visibly different names, this could be a '(2)' tacked onto the end of a file if the name is auto generated, or if the name is something a human is picking, it could be "please pick another name, it looks too close to <preview of other object> <object name>". The other instances I've run into all seem to be cases where there's a canonical spelling and then a "folded for Latin users" writing. I certainly can't speak for all cases. > and it doesn't matter elsewhere. Actually, i think we ended up trying to compile blacklists while developing punycode [2] for IDN [3]. I guess rfc 4290 [4], 4713 [5], 5564 [6], and 5992 [7], have tables which while not complete are certainly referencable, and given that UAs already have to deal with punycode, it's likely that they'd have access to those tables. I think the relevant section from unicode 6 [8] is probably 5.18 Case Mappings (page 171?) > Where case distinctions are not important, other distinctions between Unicode characters > (in particular, compatibility distinctions) are generally ignored as well. In such circumstances, > text can be normalized to Normalization Form NFKC or NFKD after case folding, > thereby producing a normalized form that erases both compatibility distinctions and case > distinctions. I think this is probably what I want > However, such normalization should generally be done only on a restricted > repertoire, such as identifiers (alphanumerics). Yes, I'm hand waving at this requirement - filenames are in a way identifiers, you aren't supposed to encode an essay in a filename. > See Unicode Standard Annex #15, “Unicode > Normalization Forms,” and Unicode Standard Annex #31, “Unicode Identifier and > Pattern Syntax,” for more information. For a summary, see “Equivalent Sequences” in > Section 2.2, Unicode Design Principles. > Caseless matching is only an approximation of the language-specific rules governing the > strength of comparisons. Language-specific case matching can be derived from the collation > data for the language, where only the first- and second-level differences are used. For > more information, see Unicode Technical Standard #10, “Unicode Collation Algorithm.” Of note: > In most environments, such as in file systems, text is not and cannot be tagged with language > information. In such cases, the language-specific mappings must not be used. > Otherwise, > data structures such as B-trees might be built based on one set of case foldings and > used based on a different set of case foldings. This discrepancy would cause those data > structures to become corrupt. Of note: > For such environments, a constant, language-independent, default case folding is required. On the subject of file names and encodings, I can't recall if you specified normalization equivalence. From memory OS X and Windows disagree on decomposition-encoding or something (please forgive me for not having the right terms here, I can look them up in a later email). I think that addressing the concerns above as suggested would address this too. [1] http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail [2] http://en.wikipedia.org/wiki/Punycode [3] http://www.mozilla.org/projects/security/tld-idn-policy-list.html [4] http://tools.ietf.org/html/rfc4290 [5] http://tools.ietf.org/html/rfc4713 [6] http://tools.ietf.org/html/rfc5564 [7] http://tools.ietf.org/html/rfc5992 [8] http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
Received on Thursday, 12 May 2011 08:40:51 UTC