W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

From: David Clarke <w3@dragonthoughts.co.uk>
Date: Fri, 06 Feb 2009 10:34:28 +0000
Message-ID: <498C1234.8080005@dragonthoughts.co.uk>
To: Henri Sivonen <hsivonen@iki.fi>
CC: public-i18n-core@w3.org, "'W3C Style List'" <www-style@w3.org>


In an ideal world, fixing all the IME systems to produce normalised 
results would be great, but highly impractical. Decisions would also 
need to be made regarding which normalized from is the "correct one", 
and those decisions would need to be complied with.

I was always taught that in software development you should be very 
tolerant of external data (e.g. look at how well browsers deal with 
broken HTML), and strict and consistent on your output. - in short don't 
rely on other developers doing the right thing and, your thought is that 
we would need to fix all the IMEs that already exist.

If on the other hand we propose standards where the lack of 
normalisation of is tolerated, but require late normalisation, we can 
produce a functional result. As they stand, the normalization 
algorithms, and checks, are fast to execute if the input is already 
normalised to their form. With this in mind, the majority of the 
performance hit would only come when non-normalised data is presented.

I generally prefer to have my software work well and consistently 
without surprises, and performance has to be secondary to that. Of 
course I come from the school of defensive coding.

- Maybe Moore's law will solve the performance issue, but only tolerant 
coding and late normalisation can ensure that the software functional 
and reliable.

Henri Sivonen wrote:
> On Feb 5, 2009, at 19:23, Richard Ishida wrote:
>> Well, if you speak and think in excellent English there's no big deal 
>> with codepoint for codepoint comparison.  But if you speak and think 
>> in Vietnamese, Burmese, Khmer, Tamil, Malayalam, Kannada, Telugu, 
>> Sinhala, Tlįchǫ Yatìi, Dënesųłįne, Dene Zhatié–Shihgot’ine, Gwich’in, 
>> Dɛnɛsųłįnɛ, Igbo, Yoruba, Arabic, Urdu, Azeri, Tibetan, Japanese, 
>> Chinese, Russian, Serbian, etc. etc. and especially if your content 
>> is in that language, then it wouldn't be so surprising that you would 
>> want to write class names and ids in that language too, and I think 
>> we need to investigate what is needed to support that.
> Using class names or ids made of words in those languages is enabled. 
> It's just that inconsistent defects in text input software may lead to 
> surprises in some cases. However, to get rid of the surprises, the 
> text input methods should be fixed instead of complicating other 
> software.
David Clarke
Received on Friday, 6 February 2009 10:35:18 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:23:04 UTC