Re: Results of CSS case-sensitivity discussion at TPAC from John Daggett on 2012-12-14 (www-international@w3.org from October to December 2012)

From: John Daggett <jdaggett@mozilla.com>
Date: Thu, 13 Dec 2012 23:29:27 -0800 (PST)
To: WWW International <www-international@w3.org>
Message-ID: <1179148702.6216841.1355470167941.JavaMail.root@mozilla.com>
Addison Phillips wrote:

> Anne van Kesteren wrote:
> 
> > I think this is overkill. The whole platform apart from JavaScript
> > uses ASCII case-insensitivity, largely for historical reasons
> > (code-point-for-code-point would be preferable). There's no reason
> > to invoke complex algorithms to compare CSS identifiers as long as
> > they are not necessitated elsewhere.
>
> "The whole platform" is an overstatement. HTML5 carefully avoids the
> issues: ASCII case insensitivity is used extensively where the
> namespace is already limited to (ASCII-only) identifiers. Where
> non-ASCII tokens can appear, you invariably find that HTML5
> specifies "case-sensitive" comparison (which I may as well point out
> is also a "normalization sensitive" comparison, code point by code
> point).
> 
> The "complex algorithms" you mention might actually be easier to
> implement than you think, though. Most case-insensitive comparison
> functions in standard libraries are actually internationalized and
> already do the approximately right thing. A quick survey of browsers
> on my desktop computer using the following page shows that IE9,
> Opera, Safari, and Chrome are already non-ASCII case-insensitive
> (only FF seems to be ASCII-only case-insensitive):
> 
>    http://www.inter-locale.com/test/css-case-sensitive-test.html
> 
> So a case could even be made that non-ASCII caseless comparison is
> actually "what browsers do".

As others have pointed out, there are a myriad of ways browsers treat
the notion of a case-insensitive comparison (*sigh*).  See my previous
post which included a set of testcases [1] and some of the weird code
oddities it turned up [2].  But to be clear, there is no situation
where full case mappings (C + F) are used by a user agent today.

I think Anne is fully justified in calling full Unicode case matching
"complex" and I'm mystified why you would suggest otherwise.  Unicode
already has a extensive documentation on case algorithms [3] and an
entire annex describing precisely the problem faced in CSS, namely
case insensitive matching of identifiers:

  Unicode Identifier And Pattern Syntax
  http://www.unicode.org/reports/tr31/

>From Unicode 6.2, Chapter 3, pg. 117:

  A modified form of Default Case Folding is designed for
  best behavior when doing caseless matching of strings
  interpreted as identifiers. This folding is based on
  Case_Folding(C), but also removes any characters which
  have the Unicode property value
  Default_Ignorable_Code_Point=True. It also maps characters
  to their NFKC equivalent sequences. Once the mapping for a
  string is complete, the resulting string is then
  normalized to NFC. That last normalization step simplifies
  the statement of the use of this folding for caseless
  matching.

It seems like what is being suggested for CSS is equivalent to the use of the
toNFKC_Casefold(X) function described on pg. 118. This involves two
normalization steps and the C + F case mapping step.  UAX 31 also
suggests several modifications to the normal NFKC procedure:

  http://www.unicode.org/reports/tr31/#NFKC_Modifications

There may be code already that accurately implements these exact steps
(ICU?) but calling them "complex" is entirely appropriate.  To be
entirely fair, there are similar issues already with case sensitive
matching, since currently no user agent attempts to do normalization
as part of the comparison, which means that strings containing
diacritics won't match equivalent strings containing precomposed
characters.  But I think these problems need to be addressed
consistently across the platform and not just in particular cases such
as CSS.

> If case-insensitivity is not a feature but is merely an historical
> artifact, then ASCII-only insensitivity makes some kind of sense,
> albeit one that is equally inconvenient to implement (if you have to
> write a special function to implement "caseless comparison" anyway
> and then build tests to ensure that it works correctly and
> performantly). But it still makes for crummy authoring experience.
> You have to keep in mind all the time the difference between ASCII
> keys on your keyboard and "the other keys". And software generated
> pages can be harder to manage too.

I think "crummy authoring experience" is hyperbole here, ASCII case
insensitivity is useful when the the design language is entirely
composed of ASCII-only keywords, as is the case in HTML and CSS. I
doubt Turkish authors are in any way disrupted by not being able to
use dotted I's in their tagnames.

Having CSS user-defined identifiers follow this was simply a matter of
consistency with what's been done in the past.  But all user agents
currently consistently implement case sensitive matching for counters,
the one case of user-defined identifiers in CSS 2.1.  So maybe the
most sensible solution here is simply to continue existing behavior
and make all user-defined identifiers be matched case sensitively,
defined in the same loosey-goosey, hand wavy way it's been defined up
until now.

Regards,

John Daggett

[1] http://lists.w3.org/Archives/Public/www-style/2012Dec/0149.html
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=820909
[3] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
Received on Friday, 14 December 2012 07:29:58 UTC