Re: labels and case sensitivity from Ross Wetmore on 2000-10-05 (ietf-dav-versioning@w3.org from October to December 2000)

From: Ross Wetmore <rwetmore@verticalsky.com>
Date: Thu, 05 Oct 2000 13:04:21 -0400
To: infonuovo@email.com
Cc: "Geoffrey M. Clemm" <geoffrey.clemm@rational.com>, ietf-dav-versioning@w3.org
Message-ID: <39DCB495.943F69D2@verticalsky.com>
I think the comments on locale-independent case folding and the accompanying CaseFolding.txt file are what you are looking for? 

I think this is still in the realm of implementation, and as such is beyond the scope of mandatory requirements by this draft. Given the desire for terseness, even suggesting this might be too much :-).

Cheers,
RossW
=====

From http://www.unicode.org/unicode/reports/tr21/

2.3 Caseless Matching

Caseless matching is commonly implemented using case-folding. The latter is the process of mapping strings to a normalized form where case differences are erased. Case-folding allows for fast caseless matches in lookups. Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Where locale-sensitive case matching is used, this information can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, in most environments, such as in file systems, text is not tagged with locale information. In such cases, the locale-specific mappings should not be used. Otherwise data structures, such as B-trees, might be built based on one set of case foldings, and used based on a different set, which will cause the B-trees to become corrupt. For those environments, a constant, locale-independent case folding should be used.

The CaseFolding.txt file can be used for doing such a locale-independent case folding. This file was generated from the Unicode Character Database, using both the one-to-one mappings and the one-to-many mappings. It folds all characters having different case forms together into a common form. To compare two strings for caseless matching, you can fold each string using this data, and then use a binary comparison.

Generally, where case distinctions are not important, other distinctions between Unicode characters (in particular, compatibility distinctions) are ignored as well. In such circumstances, text can be normalized to Normalization Form KC or KD after case-folding, to produce a normalized form that erases both compatibility distinctions and case distinctions. (See UTR #15: Unicode Normalization Forms for more information.)


"Dennis E. Hamilton" wrote:
> 
> There's a misunderstanding here.
> 
> There are no "rules for case-insensitive comparison" that are implemented in
> common among all implementations of standard libraries.  Having
> non-standardized functions of the same name and signature is not solving
> this problem.  Apparently, each of those implementers felt that the rules
> weren't too difficult to grasp and were indeed self-evident.  The
> consequence, in the current state of affairs, is that two WebDAV servers, or
> a server and a client, coming up with the same case-insensitive compare
> result is at best a coincidence or the happy consequence of having used the
> same platform and development software.  (I found *different*
> implementations in a comparison of Visual C++ 6.0 libraries, though.  This
> is a wonderful  example of the value of leaving it to the implementation
> absent a specification of the rules that are followed.)
> 
> I agree that having a single character set, in the fashion of Unicode, cuts
> down the variables to be dealt with.  However, the rules for
> case-insensitive comparison can still vary by locale, and because of
> ambiguities in Unicode this does not seem to be resolved.
> 
> Please site an unambiguous specification of a Unicode caseless compare (is
> there even one in Java, the house of run-the-same-everywhere?), and agree on
> that as a requirement for DeltaV "case-insensitive comparison."   Then I
> agree that you are home free.  Failing that, all that is being created is
> people getting caught by interoperability problems in the wild, at a point
> where it will be very difficult to resolve the problems.  There is no reason
> to have these problem at all.
> 
> Here's the challenge.  Find a place where the rules are actually laid out,
> rather than described in principal.  Shouldn't be hard yes?  Or even find an
> implementation that you are willing to reverse-engineer the rules from.  I
> say it is necessary to be specific.  Why not.  It is easy, yes?
> 
> -- Dennis
> 
> AIIM DMware Technical Coordinator
> http://www.infonuovo.com/dmware
> ------------------
> Dennis E. Hamilton
> InfoNuovo
> mailto:infonuovo@email.com
> tel. +1-206-779-9430 (gsm)
> fax. +1-425-793-0283
> http://www.infonuovo.com
> 
> -----Original Message-----
> From: Ross Wetmore [mailto:rwetmore@verticalsky.com]
> Sent: Thursday, October 05, 2000 06:56
> To: infonuovo@email.com
> Cc: Geoffrey M. Clemm; ietf-dav-versioning@w3.org
> Subject: Re: labels and case sensitivity
> 
> Sorry for the delay in response, practical realities can take precedence.
> 
> Most of these issues were dealt with in the early nineties by widespread
> industry groups such as the Unicode Consortium (www.unicode.org) or
> XOpen standards such as XPG4 in 1992 (www.opengroup.org).
> 
> I find it difficult to believe the rules for case-insensitive comparison
> are that difficult to grasp, though I admit the implementations for the
> old 8-bit character representations had to deal with the fundamental
> ambiguities of context dependent mappings and can be pretty hairy
> reading. The solution for such implementation problems of course is to
> agree on the context or to remove the ambiguities in the character set
> representations. The latter means either restricting use to the POSIX
> subset of all locales, or using one of the "universal" sets like Unicode.
> 
> Instead of 8-bit string functions such as strcoll, you might want to look
> at the WCS equivalents in most standard libraries. Issues such as no upper
> or lower case form are smoothly handled by towupper(), towlower(). If
> you are concerned about equivalent forms (which is stretching things for
> case insensitivity), try wcscoll(). Most of the major programming
> environments, Java, Windows, and commercial Unix variants should have all
> the implementation tools needed. Many have a function such as wcsicmp(),
> or String.equalsIgnoreCase().
> 
> I believe the current draft mandates UTF8-Unicode as the format for
> exchanging character data. Further specification of the implementation
> details in the standard is probably not in keeping with the tradition of
> such things and should be left to the implementors, right?
> 
> Cheers,
> RossW
> 
> "Dennis E. Hamilton" wrote:
> >
> > Oh oh,
> >
> > I dunno about case insensitivity.
> >
> > With regard to what locale?
> >
> > Who has a fixed case insensitivity rule that works the same across all
> > locales in Unicode?
> >
> > This is only slightly less problematic than dealing with locale-preferred
> > collating sequences.
> >
> > Case insensitive comparisons do not always work when one party is using
> the
> > C locale (in which we think we know what case insensitivity means and
> which
> > we tend to think is the universal locale for people's data but isn't) and
> > another locale/language that has some letters that have no capital form,
> > others that have no lower-case form, and none of them are in the
> [extended?]
> > "Roman" alphabet of the typical 26 A to Z.
> >
> > I am responding out of context, but I wanted to offer the warning that
> > unless you are shoe-horned into ISO 646, this is not as obvious as most
> > people think.
> >
> > I think if you dig through Plauger's text on the Standard C (or C++)
> > string.h and ctypes.h library, you will see what there is to be concerned
> > about.  By the way, this problem is of such interesting complexity that
> > there is no caseless string compare in the Standard C library.  The
> > non-standard ones are generally not well-specified and reading the actual
> > code in certain widely-used library implementations is a real eye-opener.
> >
> > Unless you are prepared to rigorously specify the rules for
> case-insensitive
> > comparison, you are creating a problem, not solving one.
> >
> > -- Dennis
> >
> > AIIM DMware Technical Coordinator
> > http://www.infonuovo.com/dmware
> > ------------------
> > Dennis E. Hamilton
> > InfoNuovo
> > mailto:infonuovo@email.com
> > tel. +1-206-779-9430 (gsm)
> > fax. +1-425-793-0283
> > http://www.infonuovo.com
> >
> > -----Original Message-----
> > From: ietf-dav-versioning-request@w3.org
> > [mailto:ietf-dav-versioning-request@w3.org]On Behalf Of Ross Wetmore
> > Sent: Monday, October 02, 2000 14:11
> > To: Geoffrey M. Clemm
> > Cc: ietf-dav-versioning@w3.org
> > Subject: Re: labels and case sensitivity
> >
> > It might be more reasonable to change the wording to "should"
> > for case preservation, and make it a "must" that all label
> > comparisons are done case insensitive. There is no need to know
> > anything about server properties to operate under these conditions.
> >
> > The argument for case preservation is generally one of usability or
> > legibility in that CapitalizationCanHelpDistinguishContext. But
> > there are enough cases where case does not exist or is lost, as when
> > a user has to type in a very long string, that requiring it is too
> > strong a condition.
> >
> > Case insensitive comparisons will always work without any further
> > special handling and are the flip side of minimizing requirements
> > but being flexible in accomodating different perspectives. The only
> > significant argument against this is reduction in namespace size.
> >
> > Cheers,
> > RossW
> >
> > "Geoffrey M. Clemm" wrote:
> > >
> > > Currently, we require that labels be case preserving and
> > > that label comparison be case sensitive.
> > > Chris Kaler would like to have it be server-defined whether
> > > labels are case sensitive.
> > >
> > > Comments?
> > >
> > > Cheers,
> > > Geoff
> > >
> > > (I'll save my comments for a followup, so as not to confuse
> > > Chris' request with my comments).
Received on Thursday, 5 October 2000 13:04:05 UTC