Re: [css3-fonts][css-variables][css-counter-styles-3][css3-values] Case sensitivity of user-defined identifiers from John Daggett on 2012-10-01 (www-style@w3.org from October 2012)

From: John Daggett <jdaggett@mozilla.com>
Date: Sun, 30 Sep 2012 19:59:12 -0700 (PDT)
To: www-style list <www-style@w3.org>
Message-ID: <1338754109.1272691.1349060352798.JavaMail.root@mozilla.com>
Tab Atkins wrote:

> At the last F2F we discussed the issue of the case of user-defined
> identifiers.  CSS-defined idents are all case-insensitive; however,
> they are also all within the ASCII range, so it's very trivial to
> define what that means.  User-defined idents can contain all of
> Unicode, which potentially makes the problem much larger. 
> Currently, we define that user-defined idents are case-sensitive,
> but that's not easily compatible with several things we're doing
> where user-defined idents mix with CSS-defined idents, or where
> something that was previous CSS-defined is reinterpreted into a
> user-defined one in the UA stylesheet.
> 
> So, at the F2F we agreed to make user-defined idents
> case-insensitive, but deferred defining exactly what that means
> until we get some more research on what's necessary/possible.
> 
> There are a few reasonable options:
> 
> 1. ASCII case-insensitivity.  This is the minimal change to be
> compatible with CSS-defined idents migrating to being user-defined.
> However, it's very limited - if you write an ident in a language
> other than English, you may very well run up against casing issues
> that should be "obvious" to solve.
> 
> 2. Latin1 case-insenstivity.  This allows case-insensitivity within
> all the "English-like" languages, so that, for example, an accented
> lowercase letter is equal to its uppercase version.
> 
> 3. Unicode case-insensitivity.  I'm not very familiar with the
> possibility-space here, but I'm told there are algorithms for doing
> case-folding in Unicode.  We specifically need one that can
> *canonicalize* a string into a comparable form, so we can turn it
> into an atomic string and do fast comparisons later, rather than one
> that can only be used to compare two strings.
>
> jdaggett was supposed to look into option 3 and report back.  John,
> any progress?  This is a blocking issue for both Variables and
> Counter Styles.

To start with, Unicode already provides a document describing how to properly
define identifier syntax for Unicode strings [1].  However, this effectively
requires support for full case mapping plus normalization.

The case mapping rules defined in the CaseFolding.txt file of the Unicode 
database provide for two basic types of "canonical" strings, simple case mapping
where all mappings are provided as 1-to-1 replacements and nothing is language
specific or full case mapping where 1-to-n replacements are possible and some
of the mappings can be language sensitive.  When people say "Unicode case mapping
is hard" they're really referring to *full* case mapping, not the simple variant.

My thought originally was that (3) could be implemented using simple case
mapping, rather than doing full case mapping.  But after doing some testing of
how case insensitivity is currently implemented, I think the existing behavior
is enough of a hodgepodge that it doesn't make sense to create a differentiation
just for user-defined identifiers.  For example, existing element id's and 
class attributes *are* case sensitive but counters are not.  However, lookups
done using document.getElementById are case *insensitive*.  Oy, my head hurts...

So I think we should stick with (1) and not try to create new additional casing
rules.  I'm not suggesting this is ideal but I think the "ideal" way of using 
normalization and full case mapping needs to first be addressed in a web-wide way
rather than just within CSS.

> However, it's very limited - if you write an ident in a language
> other than English, you may very well run up against casing issues
> that should be "obvious" to solve.

I don't understand what this sentence implies.  The existing rule for CSS is
case-sensitive matching outside the ASCII range.  What are the "casing issues"
here?  Yes, it's simple and crude and by no means ideal but it is what it is,
I'm not sure I see "issues" here.

Cheers,

John Daggett

[1] http://www.unicode.org/reports/tr31/
Received on Monday, 1 October 2012 02:59:40 UTC