- From: John Daggett <jdaggett@mozilla.com>
- Date: Thu, 13 Dec 2012 23:29:27 -0800 (PST)
- To: WWW International <www-international@w3.org>
Addison Phillips wrote: > Anne van Kesteren wrote: > > > I think this is overkill. The whole platform apart from JavaScript > > uses ASCII case-insensitivity, largely for historical reasons > > (code-point-for-code-point would be preferable). There's no reason > > to invoke complex algorithms to compare CSS identifiers as long as > > they are not necessitated elsewhere. > > "The whole platform" is an overstatement. HTML5 carefully avoids the > issues: ASCII case insensitivity is used extensively where the > namespace is already limited to (ASCII-only) identifiers. Where > non-ASCII tokens can appear, you invariably find that HTML5 > specifies "case-sensitive" comparison (which I may as well point out > is also a "normalization sensitive" comparison, code point by code > point). > > The "complex algorithms" you mention might actually be easier to > implement than you think, though. Most case-insensitive comparison > functions in standard libraries are actually internationalized and > already do the approximately right thing. A quick survey of browsers > on my desktop computer using the following page shows that IE9, > Opera, Safari, and Chrome are already non-ASCII case-insensitive > (only FF seems to be ASCII-only case-insensitive): > > http://www.inter-locale.com/test/css-case-sensitive-test.html > > So a case could even be made that non-ASCII caseless comparison is > actually "what browsers do". As others have pointed out, there are a myriad of ways browsers treat the notion of a case-insensitive comparison (*sigh*). See my previous post which included a set of testcases [1] and some of the weird code oddities it turned up [2]. But to be clear, there is no situation where full case mappings (C + F) are used by a user agent today. I think Anne is fully justified in calling full Unicode case matching "complex" and I'm mystified why you would suggest otherwise. Unicode already has a extensive documentation on case algorithms [3] and an entire annex describing precisely the problem faced in CSS, namely case insensitive matching of identifiers: Unicode Identifier And Pattern Syntax http://www.unicode.org/reports/tr31/ >From Unicode 6.2, Chapter 3, pg. 117: A modified form of Default Case Folding is designed for best behavior when doing caseless matching of strings interpreted as identifiers. This folding is based on Case_Folding(C), but also removes any characters which have the Unicode property value Default_Ignorable_Code_Point=True. It also maps characters to their NFKC equivalent sequences. Once the mapping for a string is complete, the resulting string is then normalized to NFC. That last normalization step simplifies the statement of the use of this folding for caseless matching. It seems like what is being suggested for CSS is equivalent to the use of the toNFKC_Casefold(X) function described on pg. 118. This involves two normalization steps and the C + F case mapping step. UAX 31 also suggests several modifications to the normal NFKC procedure: http://www.unicode.org/reports/tr31/#NFKC_Modifications There may be code already that accurately implements these exact steps (ICU?) but calling them "complex" is entirely appropriate. To be entirely fair, there are similar issues already with case sensitive matching, since currently no user agent attempts to do normalization as part of the comparison, which means that strings containing diacritics won't match equivalent strings containing precomposed characters. But I think these problems need to be addressed consistently across the platform and not just in particular cases such as CSS. > If case-insensitivity is not a feature but is merely an historical > artifact, then ASCII-only insensitivity makes some kind of sense, > albeit one that is equally inconvenient to implement (if you have to > write a special function to implement "caseless comparison" anyway > and then build tests to ensure that it works correctly and > performantly). But it still makes for crummy authoring experience. > You have to keep in mind all the time the difference between ASCII > keys on your keyboard and "the other keys". And software generated > pages can be harder to manage too. I think "crummy authoring experience" is hyperbole here, ASCII case insensitivity is useful when the the design language is entirely composed of ASCII-only keywords, as is the case in HTML and CSS. I doubt Turkish authors are in any way disrupted by not being able to use dotted I's in their tagnames. Having CSS user-defined identifiers follow this was simply a matter of consistency with what's been done in the past. But all user agents currently consistently implement case sensitive matching for counters, the one case of user-defined identifiers in CSS 2.1. So maybe the most sensible solution here is simply to continue existing behavior and make all user-defined identifiers be matched case sensitively, defined in the same loosey-goosey, hand wavy way it's been defined up until now. Regards, John Daggett [1] http://lists.w3.org/Archives/Public/www-style/2012Dec/0149.html [2] https://bugzilla.mozilla.org/show_bug.cgi?id=820909 [3] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
Received on Friday, 14 December 2012 07:29:58 UTC