Re: Unicode caseless matching details (was Re: [CSSWG] Minutes Tucson F2F 2013-02-05 Tue PM I: Fonts)

On 8/3/13 21:00, Phillips, Addison wrote:
> Some weeks ago, Jonathan Kew wrote:
>
>>>
>>> Given the resistance there seems to be to implementing a -full-
>>> solution to string-equivalence issues, I don't see why we'd
>>> require people to implement anything more complex/expensive than
>>> a purely 1:1 mapping in this particular case.
>>
>> As Tab mentioned, the Internationalization Group concluded that C+F
>> was the "right" way, as posted last month [1].  The description
>> given by Addison was:
>>
>> Case Insensitive comparison: Where CSS cannot be case-insensitive
>> for legacy reasons or for implementation choice reasons, the I18N
>> WG recommends that comparison be done using Unicode "common" plus
>> "full" case fold mapping, as we previously recommended. Suggestions
>> that this is hard to implement or low-performance are, in our
>> opinion, unfounded, as the mapping consists of a relatively small
>> table. There is a demonstration implementation in JavaScript and we
>> have confirmed with our Unicode colleagues that this is the right
>> approach [2].
>>
>> I would be fine with either C+S or C+F mappings, but I think we
>> should take care to define only a single "Unicode caseless
>> matching" if at all possible for use across all Web platform. I'm
>> not especially keen on defining it in the Fonts spec but for now
>> it's only needed there. I think it would be unfortunate to use C+F
>> in some places and C+S in others.
>
> I agree that only one should be defined.
>
> Generally speaking, the "simple" (C+S) case is less good than the
> "full" (C+F) case for matching in part because the comparison needs
> to work both ways---as a casefold transform on the search term as
> well as on the searched corpus. The difference between C+S and C+F is
> mainly that the latter casefolds certain characters to a
> multicharacter sequence. This sequence may actually be the one used
> in the searched values. Using C+F therefore results in higher match
> fidelity.
>
>>
>> I should note here that HTML5 specifies a different flavor of
>> caseless matching for radio button name attributes but I think
>> that's actually a mistake and have filed a bug on that, it's trying
>> to use a particular Unicode caseless matching algorithm to mimic
>> the matching behavior in IE, which clearly uses some flavor of
>> platform-specific caseless matching with normalization.
>>
>
> It would be best if everyone used the same specific matching scheme
> for caseless matching. That's easier for content authors to
> understand.
>
> At the moment, because normalization is effectively not part of the
> "rules of the road", the I18N WG is recommending that specs and
> implementations *not* include normalization in internal identifier
> matching (such as the radio button case). However, for caseless
> matching we feel that C+F is the way to go and should form the base
> for a caseless match algorithm. We are in the process of revising
> CharMod to say this (and to provide detailed and specific
> guidelines).
>
> User text searching features (such as the "find" command in most
> browsers) are a separate topic (and one where we feel that
> normalization is probably advisable), but this is a separate case
> (which we'll cover in CharMod).
>
> So... for Font name matching, although there might be very minor
> efficiency gains found using C+S, the I18N WG recommends that, for
> consistency, C+F be used.

As long as normalization is -excluded- from the matching algorithm, the 
benefit of C+F over C+S is so marginal as to be almost irrelevant. It 
will allow "heiß" to match "HEISS", but it still won't enable "grüß" to 
match "grüß", or "साफ़" to match "साफ़";[1] yet it has a cost in code 
complexity that every implementation will have to pay.

It seems to me that if implementations are going to use something more 
complex than un-normalized C+S (which has the key property that it 
depends only on 1:1 mappings and comparison, never 1:n) for matching, 
the first priority should be normalization.

IMO, we should -either- optimize for implementation simplicity and 
efficiency (specify un-normalized C+S matching) -or- aim to match user 
perceptions of equivalence (specify canonically-normalized C+F); but 
un-normalized C+F falls squarely between the two stools, fulfilling 
neither of the competing requirements.

Regards,

JK

[1] In case normalization kicks in somewhere between my keyboard and 
your inbox, these were typed as:
  <U+0067,U+0072,U+00FC,U+00DF> vs <U+0067,U+0072,U+0075,U+0308,U+00DF>
and:
  <U+0938,U+093E,U+092B,U+093C> vs <U+0938,U+093E,U+095E>
respectively.

Received on Monday, 11 March 2013 15:23:06 UTC