- From: Peter Moulder <pjrm@mail.internode.on.net>
- Date: Sat, 3 May 2014 19:38:32 +1000
- To: www-style@w3.org, Simon Sapin <simon.sapin@exyr.org>
On Fri, Apr 25, 2014 at 12:22:53AM +0100, Simon Sapin wrote: > On 24/04/2014 23:06, Peter Moulder wrote: > >>No, we *never* make author-defined names case-insensitive, because > >>"case-insensitive" gets complicated once Unicode comes into play (and > >>drags along "normalized" and other notions of equivalency). To avoid > >>all of that, we just mandate case-sensitivity, which means literal > >>codepoint comparisons. > >I don't understand this last paragraph. In what way does honouring > >the quoted sentence of syndata.html get complicated once Unicode comes > >into play, and how does case-sensitivity avoid normalization issues of > >whether decomposed and precomposed mean the same thing? > > Case-insensitivity within the ASCII range is easy to define: map 26 > letters, done. > > It get complicated quickly with Unicode: you can pick "simple" or > "full" case folding [and issues with ß, İ] I suspect that this is what Tab had in mind too, but these problems don't apply: if you re-read the first post and its quoted sentence from syndata.html, it's clear that we're talking about ASCII case folding only. > Precomposed vs. decomposed combining code points is not directly > related to case folding but they’re two kinds of normalization. If > you’re doing one, why not the other? That seems a strange question to ask: you yourself have given reasons we might want to avoid doing case-folding normalization outside of the ASCII range, and these reasons apply whether or not we choose to do precomposed/decomposed normalization. > We chose to ignore all these issues Sadly, ignoring them doesn't make them go away :) . But more to the point, introducing an inconsistency with syndata.html doesn't make them go away either, i.e. ASCII case sensitivity has no effect (good or bad) on any issues associated with applying or not applying any form of Unicode normalization. > and simply compare code points > for equality when matching author-defined things. I think all of the normalizations we might consider have a canonical form so that it ends up as being just a test for code point equality; the choices differ mainly in how baffling things are for users. For example, I think we'd all agree that one normalization we should perform is conversion to a common character set such as unicode before comparing code points. That by itself isn't always enough to achieve equality after copy-and-paste from a stylesheet in one charset to another (e.g. they might differ in precomposed vs decomposed, as is the case between two common charsets for Vietnamese), and doesn't achieve the Unicode specification that canonically equivalent strings should "have the same [... and] behavior", so we might well also want NFD (or NFC) normalization. Most normalization problems I've heard of come from compatibility equivalence (the one where Kelvin sign matches K). Unicode allows compatibility-equivalent strings to behave differently, and I wouldn't be surprised if the group decides not to do compatibility normalization (NFKC/NFKD). Don't take this message as pushing in favour of a particular degree of normalization (I'm not well placed to know the costs in either direction), I'm just pointing out that there seems to be a misunderstanding of the proposal, and I'm making sure that some relevant issues are considered in the decision. pjrm.
Received on Saturday, 3 May 2014 09:38:59 UTC