Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Henri Sivonen on 2009-02-02 (www-style@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 2 Feb 2009 14:18:44 +0200
To: Jonathan Kew <jonathan@jfkew.plus.com>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <BA4C173F-8755-4E3C-A21A-D70FB553FDC0@iki.fi>

On Jan 30, 2009, at 17:02, Jonathan Kew wrote:

> On 30 Jan 2009, at 14:24, Anne van Kesteren wrote:
>> I may be biased,
>
> We all are, in various ways!
>
> I'd guess that almost all of us here are pretty comfortable using  
> English (otherwise how would we be having this discussion?), and the  
> expectation that programming and markup languages are English-based  
> is deeply ingrained. Some of us, perhaps, like to include comments  
> in another language, or even use variable names in another (normally  
> Western European) language, but that's as far as it goes.
[...]
> It's supposed to be the World Wide Web, not the Western World's  
> Web. :)

The written forms of non-English "Western" languages are not invariant  
under Unicode normalization. If one is of the opinion that it doesn't  
make sense to perform Unicode normalization of identifiers on the Web  
consumer side, it is not an issue of Western bias.

In my opinion, identifier comparison in Web languages should be made  
on code point for code point basis except where backward compatibility  
requires additionally treating the Basic Latin characters a–z as  
equivalent of A–Z in which case those ranges should be considered  
equivalent and everything else be compared on a code point for code  
point basis. This approach is good for performance and backward  
compatibility.

I think the right place to do normalization for Web formats is in the  
text editor used to write the code, and the normalization form should  
be NFC.

> An alternative would be to significantly restrict the set of  
> characters that are legal in names/identifiers. However, this tends  
> to also restrict the set of languages that can be used for such  
> names, which I don't think is a good thing.

In the context of text/html and CSS, that doesn't really solve the  
processing issue, since it would still be necessary to define behavior  
for non-conforming content.

If one is only concerned with addressing the issue for conforming  
content or interested in making problems detectable by authors, I  
think it makes to stipulate as an authoring requirement that both the  
unparsed source text and the parsed identifiers be in NFC and make  
validators check this (but not make non-validator consumers do  
anything about it). Validator.nu already does this for HTML5, so if  
someone writes a class name with a broken text editor (i.e. one that  
doesn't normalize keyboard input to NFC), the validator can be used to  
detect the problem.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 2 February 2009 12:19:27 UTC