On 16 January 2013 11:30, Simon Sapin <simon.sapin@kozea.fr> wrote:
> "Common" plus "full" case fold mapping. I’m not expression an opinion for
> or against this here, but I was confused as to what it means exactly. In
> various Unicode documents, one can read about "default", "simple",
> "special", "NFKC" case folding. How do these relate to "common" and "full"?
>
These are defined in the CaseFolding.txt file in the Unicode Character
Database -- it uses C (common), F (full) and S (simple) statuses. The C + S
statuses are derived exclusively from UnicodeData.txt and map a Unicode
codepoint to a single Unicode codepoint. The C + F statuses uses the data
from SpecialCasing.txt as well which adds Unicode codepoints that case fold
to more than one Unicode codepoint.
I’ll still need a more careful examination to know how to implement it, or
> to decide if Python’s casefold() method is the same:
>
> http://docs.python.org/3.3/library/stdtypes.html#str.casefold
>
>From the python docs "For example, the German lowercase letter 'ß' is
equivalent to "ss". Since it is already lowercase,
lower()<http://docs.python.org/3.3/library/stdtypes.html#str.lower>would
do nothing to
'ß'; casefold()<http://docs.python.org/3.3/library/stdtypes.html#str.casefold>converts
it to
"ss"."
This would indicate that:
1. lower() is using the simple mapping (C + S) -- that is returning the
Lower_Case property from UnicodeData.txt only;
2. casefold() is using the full mapping (C + F).
This can be seen from the CaseFolding.txt file in the UCD which has the
following mappings for the Sharp S character:
F - U+00DF => U+0073 U+0073
F - U+1E9E => U+0073 U+0073
S - U+1E9E => U+00DF
HTH,
- Reece