Re: [css-counter-styles] case sensitivity of counter-style-name

On Fri, Apr 25, 2014 at 12:22:53AM +0100, Simon Sapin wrote:
> On 24/04/2014 23:06, Peter Moulder wrote:
> >>No, we *never* make author-defined names case-insensitive, because
> >>"case-insensitive" gets complicated once Unicode comes into play (and
> >>drags along "normalized" and other notions of equivalency).  To avoid
> >>all of that, we just mandate case-sensitivity, which means literal
> >>codepoint comparisons.
> >I don't understand this last paragraph.  In what way does honouring
> >the quoted sentence of syndata.html get complicated once Unicode comes
> >into play, and how does case-sensitivity avoid normalization issues of
> >whether decomposed and precomposed mean the same thing?
> 
> Case-insensitivity within the ASCII range is easy to define: map 26
> letters, done.
> 
> It get complicated quickly with Unicode: you can pick "simple" or
> "full" case folding [and issues with ß, İ]

I suspect that this is what Tab had in mind too, but these problems don't
apply: if you re-read the first post and its quoted sentence from syndata.html,
it's clear that we're talking about ASCII case folding only.

> Precomposed vs. decomposed combining code points is not directly
> related to case folding but they’re two kinds of normalization. If
> you’re doing one, why not the other?

That seems a strange question to ask: you yourself have given reasons we might
want to avoid doing case-folding normalization outside of the ASCII range, and
these reasons apply whether or not we choose to do precomposed/decomposed
normalization.

> We chose to ignore all these issues

Sadly, ignoring them doesn't make them go away :) .  But more to the point,
introducing an inconsistency with syndata.html doesn't make them go away
either, i.e. ASCII case sensitivity has no effect (good or bad) on any issues
associated with applying or not applying any form of Unicode normalization.

> and simply compare code points
> for equality when matching author-defined things.

I think all of the normalizations we might consider have a canonical form
so that it ends up as being just a test for code point equality; the choices
differ mainly in how baffling things are for users.  For example, I think
we'd all agree that one normalization we should perform is conversion to a
common character set such as unicode before comparing code points.  That by
itself isn't always enough to achieve equality after copy-and-paste from a
stylesheet in one charset to another (e.g. they might differ in precomposed
vs decomposed, as is the case between two common charsets for Vietnamese),
and doesn't achieve the Unicode specification that canonically equivalent
strings should "have the same [... and] behavior", so we might well also
want NFD (or NFC) normalization.

Most normalization problems I've heard of come from compatibility
equivalence (the one where Kelvin sign matches K).  Unicode allows
compatibility-equivalent strings to behave differently, and I wouldn't
be surprised if the group decides not to do compatibility normalization
(NFKC/NFKD).


Don't take this message as pushing in favour of a particular degree of
normalization (I'm not well placed to know the costs in either direction),
I'm just pointing out that there seems to be a misunderstanding of the
proposal, and I'm making sure that some relevant issues are considered in
the decision.

pjrm.

Received on Saturday, 3 May 2014 09:38:59 UTC