On 16 January 2013 11:30, Simon Sapin <simon.sapin@kozea.fr> wrote: > "Common" plus "full" case fold mapping. I’m not expression an opinion for > or against this here, but I was confused as to what it means exactly. In > various Unicode documents, one can read about "default", "simple", > "special", "NFKC" case folding. How do these relate to "common" and "full"? > These are defined in the CaseFolding.txt file in the Unicode Character Database -- it uses C (common), F (full) and S (simple) statuses. The C + S statuses are derived exclusively from UnicodeData.txt and map a Unicode codepoint to a single Unicode codepoint. The C + F statuses uses the data from SpecialCasing.txt as well which adds Unicode codepoints that case fold to more than one Unicode codepoint. I’ll still need a more careful examination to know how to implement it, or > to decide if Python’s casefold() method is the same: > > http://docs.python.org/3.3/library/stdtypes.html#str.casefold > >From the python docs "For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower()<http://docs.python.org/3.3/library/stdtypes.html#str.lower>would do nothing to 'ß'; casefold()<http://docs.python.org/3.3/library/stdtypes.html#str.casefold>converts it to "ss"." This would indicate that: 1. lower() is using the simple mapping (C + S) -- that is returning the Lower_Case property from UnicodeData.txt only; 2. casefold() is using the full mapping (C + F). This can be seen from the CaseFolding.txt file in the UCD which has the following mappings for the Sharp S character: F - U+00DF => U+0073 U+0073 F - U+1E9E => U+0073 U+0073 S - U+1E9E => U+00DF HTH, - ReeceReceived on Wednesday, 16 January 2013 12:22:31 UTC
This archive was generated by hypermail 2.4.0 : Friday, 25 March 2022 10:08:25 UTC