RE: Unicode caseless matching details (was Re: [CSSWG] Minutes Tucson F2F 2013-02-05 Tue PM I: Fonts)

Some weeks ago, Jonathan Kew wrote:

> >
> > Given the resistance there seems to be to implementing a -full-
> > solution to string-equivalence issues, I don't see why we'd require
> > people to implement anything more complex/expensive than a purely
> > 1:1 mapping in this particular case.
> 
> As Tab mentioned, the Internationalization Group concluded that C+F was the
> "right" way, as posted last month [1].  The description given by Addison was:
> 
>   Case Insensitive comparison: Where CSS cannot be
>   case-insensitive for legacy reasons or for implementation
>   choice reasons, the I18N WG recommends that comparison be done
>   using Unicode "common" plus "full" case fold mapping, as we
>   previously recommended. Suggestions that this is hard to
>   implement or low-performance are, in our opinion, unfounded, as
>   the mapping consists of a relatively small table. There is a
>   demonstration implementation in JavaScript and we have
>   confirmed with our Unicode colleagues that this is the right
>   approach [2].
> 
> I would be fine with either C+S or C+F mappings, but I think we should take care
> to define only a single "Unicode caseless matching" if at all possible for use
> across all Web platform. I'm not especially keen on defining it in the Fonts spec
> but for now it's only needed there.
> I think it would be unfortunate to use C+F in some places and C+S in others.

I agree that only one should be defined.

Generally speaking, the "simple" (C+S) case is less good than the "full" (C+F) case for matching in part because the comparison needs to work both ways---as a casefold transform on the search term as well as on the searched corpus. The difference between C+S and C+F is mainly that the latter casefolds certain characters to a multicharacter sequence. This sequence may actually be the one used in the searched values. Using C+F therefore results in higher match fidelity. 

> 
> I should note here that HTML5 specifies a different flavor of caseless matching
> for radio button name attributes but I think that's actually a mistake and have
> filed a bug on that, it's trying to use a particular Unicode caseless matching
> algorithm to mimic the matching behavior in IE, which clearly uses some flavor
> of platform-specific caseless matching with normalization.
> 

It would be best if everyone used the same specific matching scheme for caseless matching. That's easier for content authors to understand. 

At the moment, because normalization is effectively not part of the "rules of the road", the I18N WG is recommending that specs and implementations *not* include normalization in internal identifier matching (such as the radio button case). However, for caseless matching we feel that C+F is the way to go and should form the base for a caseless match algorithm. We are in the process of revising CharMod to say this (and to provide detailed and specific guidelines). 

User text searching features (such as the "find" command in most browsers) are a separate topic (and one where we feel that normalization is probably advisable), but this is a separate case (which we'll cover in CharMod).

So... for Font name matching, although there might be very minor efficiency gains found using C+S, the I18N WG recommends that, for consistency, C+F be used.

Hope that helps. 

Regards,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

Received on Friday, 8 March 2013 21:01:07 UTC