- From: Jonathan Kew <jfkthame@googlemail.com>
- Date: Mon, 11 Mar 2013 15:22:36 +0000
- To: "Phillips, Addison" <addison@lab126.com>
- CC: "CSS WWW Style (www-style@w3.org)" <www-style@w3.org>, "www-international@w3.org" <www-international@w3.org>
On 8/3/13 21:00, Phillips, Addison wrote: > Some weeks ago, Jonathan Kew wrote: > >>> >>> Given the resistance there seems to be to implementing a -full- >>> solution to string-equivalence issues, I don't see why we'd >>> require people to implement anything more complex/expensive than >>> a purely 1:1 mapping in this particular case. >> >> As Tab mentioned, the Internationalization Group concluded that C+F >> was the "right" way, as posted last month [1]. The description >> given by Addison was: >> >> Case Insensitive comparison: Where CSS cannot be case-insensitive >> for legacy reasons or for implementation choice reasons, the I18N >> WG recommends that comparison be done using Unicode "common" plus >> "full" case fold mapping, as we previously recommended. Suggestions >> that this is hard to implement or low-performance are, in our >> opinion, unfounded, as the mapping consists of a relatively small >> table. There is a demonstration implementation in JavaScript and we >> have confirmed with our Unicode colleagues that this is the right >> approach [2]. >> >> I would be fine with either C+S or C+F mappings, but I think we >> should take care to define only a single "Unicode caseless >> matching" if at all possible for use across all Web platform. I'm >> not especially keen on defining it in the Fonts spec but for now >> it's only needed there. I think it would be unfortunate to use C+F >> in some places and C+S in others. > > I agree that only one should be defined. > > Generally speaking, the "simple" (C+S) case is less good than the > "full" (C+F) case for matching in part because the comparison needs > to work both ways---as a casefold transform on the search term as > well as on the searched corpus. The difference between C+S and C+F is > mainly that the latter casefolds certain characters to a > multicharacter sequence. This sequence may actually be the one used > in the searched values. Using C+F therefore results in higher match > fidelity. > >> >> I should note here that HTML5 specifies a different flavor of >> caseless matching for radio button name attributes but I think >> that's actually a mistake and have filed a bug on that, it's trying >> to use a particular Unicode caseless matching algorithm to mimic >> the matching behavior in IE, which clearly uses some flavor of >> platform-specific caseless matching with normalization. >> > > It would be best if everyone used the same specific matching scheme > for caseless matching. That's easier for content authors to > understand. > > At the moment, because normalization is effectively not part of the > "rules of the road", the I18N WG is recommending that specs and > implementations *not* include normalization in internal identifier > matching (such as the radio button case). However, for caseless > matching we feel that C+F is the way to go and should form the base > for a caseless match algorithm. We are in the process of revising > CharMod to say this (and to provide detailed and specific > guidelines). > > User text searching features (such as the "find" command in most > browsers) are a separate topic (and one where we feel that > normalization is probably advisable), but this is a separate case > (which we'll cover in CharMod). > > So... for Font name matching, although there might be very minor > efficiency gains found using C+S, the I18N WG recommends that, for > consistency, C+F be used. As long as normalization is -excluded- from the matching algorithm, the benefit of C+F over C+S is so marginal as to be almost irrelevant. It will allow "heiß" to match "HEISS", but it still won't enable "grüß" to match "grüß", or "साफ़" to match "साफ़";[1] yet it has a cost in code complexity that every implementation will have to pay. It seems to me that if implementations are going to use something more complex than un-normalized C+S (which has the key property that it depends only on 1:1 mappings and comparison, never 1:n) for matching, the first priority should be normalization. IMO, we should -either- optimize for implementation simplicity and efficiency (specify un-normalized C+S matching) -or- aim to match user perceptions of equivalence (specify canonically-normalized C+F); but un-normalized C+F falls squarely between the two stools, fulfilling neither of the competing requirements. Regards, JK [1] In case normalization kicks in somewhere between my keyboard and your inbox, these were typed as: <U+0067,U+0072,U+00FC,U+00DF> vs <U+0067,U+0072,U+0075,U+0308,U+00DF> and: <U+0938,U+093E,U+092B,U+093C> vs <U+0938,U+093E,U+095E> respectively.
Received on Monday, 11 March 2013 15:23:06 UTC