Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Henri Sivonen on 2009-02-05 (www-style@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 5 Feb 2009 17:52:12 +0200
To: "Philip TAYLOR (Ret'd)" <P.Taylor@Rhul.Ac.Uk>
Cc: Jonathan Kew <jonathan@jfkew.plus.com>, Andrew Cunningham <andrewc@vicnet.net.au>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <3349F58D-CBA8-40BA-91A0-A135BF47A8CE@iki.fi>

On Feb 5, 2009, at 17:31, Philip TAYLOR (Ret'd) wrote:

> Henri Sivonen wrote:
>
>> My point is that it's generally not helpful to bring out the  
>> Western bias[1] thing in discussions of using Unicode in computer  
>> languages. Previously, too, performance has been preferred over  
>> full natural language complexity for computer language identifier  
>> equality comparison and in that instance clearly it could not have  
>> been an issue of Western bias. The thing is that comparing computer  
>> language identifiers code point for code point is the common-sense  
>> thing to do.
>
> With respect, it is the /simplest/ thing to do.  For those
> who work in anything more complex than English, it is
> probably anything /but/ "common sense".

You do realize that the language I speak natively isn't invariant  
under Unicode normalization when written? Yet, I don't insist that  
e.g. XML parsers to consult the Unicode database when doing string  
equality matching.

(Yeah, the input methods for my native language are pretty consistent  
in producing NFC, so I don't actively feel the pain of normalization- 
inconsistent input methods, but still occasionally I need to explain  
NFD to people--mostly when Mac OS X happens to leak its internals into  
interchange.)

>> If you consider the lack of case-insensitivity, some languages are  
>> not perfectly convenienced. If you consider the lack normalization,  
>> another (overlapping) set of languages is not perfectly  
>> convenienced. If you consider the sensitivity to diacritics, yet  
>> another set of languages is not perfectly convenienced. No language  
>> is prohibited by code point for code point comparison, though.
>
> Yet for many (perhaps most) of the world's languages, comparison by  
> code-point is noticeably sub-optimal.

Sure. However, easy equality checking is a more important  
characteristic of computer language identifiers than natural language  
optimality. (The content carried by XML and HTML is a different  
story.) That identifiers aren't just binary numbers but have some  
mnemonic textual interpretation is just a bonus for convenience. We  
shouldn't get carried away thinking that natural language expression  
is the primary point of having e.g. HTML ids.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 5 February 2009 15:52:55 UTC