2012/3/1 Erik Corry <erik.corry@gmail.com>
> I'm not in favour of big red switches, and I don't think the
> compartment based solution is going to be workable.
>
> I'd like to plead for a solution rather like the one Java has, where
> strings are sequences of UTF-16 codes and there are specialized ways
> to iterate over them. Looking at this entry from the Unicode FAQ:
> http://unicode.org/faq/char_combmark.html#7 there are different ways
> to describe the length (and iteration) of a string. The BRS proposal
> favours #2, but I think for most applications utf-16-based-#1 is just
> fine, and for the applications that want to "do it right" #3 is almost
> always the correct solution. Solution #3 needs library support in any
> case and has no problems with UTF-16.
>
> The central point here is that there are combining characters
> (accents) that you can't just normalize away. Getting them right has
> a lot of the same issues as surrogate pairs (you shouldn't normally
> chop them up, they count as one 'character', you can't tell how many
> of them there are in a string without looking, etc.). If you can
> handle combining characters then the surrogate pair support falls out
> pretty much for free.
>
The problem here is that you are mixing apples and oranges. Although it
*may* appear that surrogate pairs and grapheme clusters have features in
common, they operate at different semantic levels entirely. A solution that
attempts to conflate these two levels is going to cause problems at both
levels. A distinction should be maintained between the following levels:
- encoding units (e.g., UTF-16 coding units)
- unicode scalar values (code points)
- grapheme clusters
IMO, the current discussion should limit itself to the interface between
the first and second of these levels, and not introduce the third level
into the mix.
G.