Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

On 30 Jan 2009, at 14:24, Anne van Kesteren wrote:

>
> On Fri, 30 Jan 2009 15:12:35 +0100, Richard Ishida <ishida@w3.org>  
> wrote:
>> [This thread grew out of one that didn't include www.style, and has  
>> since
>> forked a little.  I am therefore pointing to a couple of emails (on  
>> the i18n public list) that didn't reach www.style but that I think  
>> are relevant.  I suggest that we henceforth keep both public-i18n  
>> and www-style copied on all emails related to this topic. ]
>>
>> See Martin's email at
>> http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0039.html
>>
>> See my response at
>> http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0041.html
>
> So
>
> 1) Do browsers normalize currently?
>
> 2) Assuming they do not, who have complained?
>
> I may be biased,

We all are, in various ways!

I'd guess that almost all of us here are pretty comfortable using  
English (otherwise how would we be having this discussion?), and the  
expectation that programming and markup languages are English-based is  
deeply ingrained. Some of us, perhaps, like to include comments in  
another language, or even use variable names in another (normally  
Western European) language, but that's as far as it goes.

With standards such as HTML and CSS, however, we should be building a  
Web that is equally welcoming to people of all cultures and languages,  
including those who would struggle to come up with variable names in  
any Latin-script language, even if they have learned to recognize  
basic tags like <html>. Where there are technical hurdles -- such as  
multiple binary representations of the same text elements -- we should  
use the tools available to us (in this case, canonical equivalence as  
defined by Unicode) to minimize the barriers these will present to the  
"have-nots" of the digital world, not just ignore them because they  
don't significantly impact the "haves".

It's supposed to be the World Wide Web, not the Western World's Web. :)

> but I have the feeling that performing Unicode Normalization on code  
> snippets is overkill.

An alternative would be to significantly restrict the set of  
characters that are legal in names/identifiers. However, this tends to  
also restrict the set of languages that can be used for such names,  
which I don't think is a good thing.

It seems to me that this issue is similar to that of Internationalized  
Domain Names, where it certainly isn't considered acceptable for there  
to be canonically-equivalent names that are treated as distinct.

> It could potentially also make certain class names and IDs identical  
> that are now different/unique. Seems like a bad idea to me.

If anyone has created content that relies on a distinction between  
names/IDs that would be erased by normalization -- i.e., names that  
are canonically equivalent, as defined by Unicode -- then that data  
and/or the processes using it are not compliant with the Unicode  
standard, and they're liable to break at some undefined point in the  
future when they attempt to interoperate with products or data that  
*are* Unicode-compliant. That's really a bad idea. Better to face the  
issue now, define appropriate and robust standards, and encourage  
anyone who has currently got such ill-designed data to fix it. (I  
doubt it actually exists, though.)

JK

Received on Friday, 30 January 2009 15:03:09 UTC