RE: Unicode caseless matching details [I18N-ACTION-198]

Hello CSS,

A couple of weeks ago I was tasked by the Internationalization WG [1] with responding to this thread. We discussed caseless matching and normalization (and fantasai participated in the discussion, which is minuted at [2]).

Basically, the thinking here was that, since font systems are somewhat diverse and fonts themselves use different encoded sequences, capitalizations, and other variations, this is a case in which both Unicode normalization and Unicode case folding are practical and justified. We would therefore recommend that you require Unicode NFC normalization and Unicode C+F case folding when comparing font names for selection. We think this is a special case because it is isolated and should have no side-effects on other parts of the Web, such as Selectors. It merely ensures that a given style sheet has the greatest likelihood of matching the intended font names as represented in the underlying system.

Regards (for I18N),

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

[1] http://www.w3.org/International/track/actions/198 I18N-ACTION-198
[2] http://lists.w3.org/Archives/Public/www-international/2013JanMar/0384.html 

> -----Original Message-----
> From: Phillips, Addison
> Sent: Friday, March 08, 2013 1:00 PM
> To: CSS WWW Style (www-style@w3.org)
> Cc: www-international@w3.org
> Subject: RE: Unicode caseless matching details (was Re: [CSSWG] Minutes
> Tucson F2F 2013-02-05 Tue PM I: Fonts)
> 
> Some weeks ago, Jonathan Kew wrote:
> 
> > >
> > > Given the resistance there seems to be to implementing a -full-
> > > solution to string-equivalence issues, I don't see why we'd require
> > > people to implement anything more complex/expensive than a purely
> > > 1:1 mapping in this particular case.
> >
> > As Tab mentioned, the Internationalization Group concluded that C+F
> > was the "right" way, as posted last month [1].  The description given by
> Addison was:
> >
> >   Case Insensitive comparison: Where CSS cannot be
> >   case-insensitive for legacy reasons or for implementation
> >   choice reasons, the I18N WG recommends that comparison be done
> >   using Unicode "common" plus "full" case fold mapping, as we
> >   previously recommended. Suggestions that this is hard to
> >   implement or low-performance are, in our opinion, unfounded, as
> >   the mapping consists of a relatively small table. There is a
> >   demonstration implementation in JavaScript and we have
> >   confirmed with our Unicode colleagues that this is the right
> >   approach [2].
> >
> > I would be fine with either C+S or C+F mappings, but I think we should
> > take care to define only a single "Unicode caseless matching" if at
> > all possible for use across all Web platform. I'm not especially keen
> > on defining it in the Fonts spec but for now it's only needed there.
> > I think it would be unfortunate to use C+F in some places and C+S in others.
> 
> I agree that only one should be defined.
> 
> Generally speaking, the "simple" (C+S) case is less good than the "full" (C+F)
> case for matching in part because the comparison needs to work both ways---as
> a casefold transform on the search term as well as on the searched corpus. The
> difference between C+S and C+F is mainly that the latter casefolds certain
> characters to a multicharacter sequence. This sequence may actually be the
> one used in the searched values. Using C+F therefore results in higher match
> fidelity.
> 
> >
> > I should note here that HTML5 specifies a different flavor of caseless
> > matching for radio button name attributes but I think that's actually
> > a mistake and have filed a bug on that, it's trying to use a
> > particular Unicode caseless matching algorithm to mimic the matching
> > behavior in IE, which clearly uses some flavor of platform-specific caseless
> matching with normalization.
> >
> 
> It would be best if everyone used the same specific matching scheme for
> caseless matching. That's easier for content authors to understand.
> 
> At the moment, because normalization is effectively not part of the "rules of
> the road", the I18N WG is recommending that specs and implementations
> *not* include normalization in internal identifier matching (such as the radio
> button case). However, for caseless matching we feel that C+F is the way to go
> and should form the base for a caseless match algorithm. We are in the
> process of revising CharMod to say this (and to provide detailed and specific
> guidelines).
> 
> User text searching features (such as the "find" command in most browsers)
> are a separate topic (and one where we feel that normalization is probably
> advisable), but this is a separate case (which we'll cover in CharMod).
> 
> So... for Font name matching, although there might be very minor efficiency
> gains found using C+S, the I18N WG recommends that, for consistency, C+F be
> used.
> 
> Hope that helps.
> 
> Regards,
> 
> Addison
> 
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N WG)
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 

Received on Tuesday, 16 April 2013 16:07:11 UTC