Re: Unicode caseless matching details (was Re: [CSSWG] Minutes Tucson F2F 2013-02-05 Tue PM I: Fonts)

Jonathan Kew wrote:

>>> Given that the Unicode-caseless-matching issue here is being
>>> addressed for such a narrow problem space, I'm a bit surprised at
>>> the choice of "full" (C+F) Unicode case folding, rather than the
>>> equally well-defined but significantly simpler (and cheaper to
>>> implement) "simple" (C+S) folding.
>>>
>>> As neither version can be completely "correct" in the sense of
>>> matching a native-speaker understanding of equivalence in all
>>> situations (impossible to achieve without addressing the issues of
>>> normalization and of locale-dependent mappings, at least), I would
>>> have thought that the marginal benefits of the "full" folding
>>> would be insufficient to justify the added complexity. (Do we
>>> really expect to see fully-accented Greek used in font-family
>>> names?)
>>
>> That's what the i18n WG recommended, so shrug.
>>
>> I don't think it's any cheaper to implement, and least not in any
>> significant sense.  C+F can be done with a simple substitution
>> table. Maybe C+S can be done with a smaller table, but aside from a
>> tiny bit of binary size, the two would be identical in complexity.
> 
> I don't think that's right. The important point is that with C+S,
> every mapping is 1:1, so the required table can be very simple in
> structure, and it is guaranteed that the folded string will be
> exactly the same length (if counted in Unicode characters, or in
> UTF16 code units).
> 
> With C+F, there are single characters that will be expanded to two
> or three characters by the mapping. This requires a more complex
> table to provide the mapping data, as well as additional code to
> handle the potential expansion of the font-family name string during
> case folding.
> 
> It's not a huge burden to implement - we can certainly do so if
> necessary - but IMO the cost is clearly non-zero, while the benefit
> is negligible.
> 
> Given the resistance there seems to be to implementing a -full-
> solution to string-equivalence issues, I don't see why we'd require
> people to implement anything more complex/expensive than a purely
> 1:1 mapping in this particular case.

As Tab mentioned, the Internationalization Group concluded that C+F
was the "right" way, as posted last month [1].  The description given
by Addison was:

  Case Insensitive comparison: Where CSS cannot be
  case-insensitive for legacy reasons or for implementation
  choice reasons, the I18N WG recommends that comparison be done
  using Unicode "common" plus "full" case fold mapping, as we
  previously recommended. Suggestions that this is hard to
  implement or low-performance are, in our opinion, unfounded, as
  the mapping consists of a relatively small table. There is a
  demonstration implementation in JavaScript and we have
  confirmed with our Unicode colleagues that this is the right
  approach [2].

I would be fine with either C+S or C+F mappings, but I think we should
take care to define only a single "Unicode caseless matching" if at
all possible for use across all Web platform. I'm not especially keen
on defining it in the Fonts spec but for now it's only needed there.
I think it would be unfortunate to use C+F in some places and C+S
in others.

I should note here that HTML5 specifies a different flavor of caseless
matching for radio button name attributes but I think that's actually
a mistake and have filed a bug on that, it's trying to use a
particular Unicode caseless matching algorithm to mimic the
matching behavior in IE, which clearly uses some flavor of
platform-specific caseless matching with normalization.

Cheers,

John

[1] http://lists.w3.org/Archives/Public/www-style/2013Jan/0184.html
[2] https://lists.w3.org/Archives/Member/member-i18n-core/2013Jan/0003.html (member-only, grumble...)

Received on Monday, 18 February 2013 08:19:16 UTC