Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Henri Sivonen on 2009-02-05 (public-i18n-core@w3.org from January to March 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 5 Feb 2009 12:39:08 +0200
To: Andrew Cunningham <andrewc@vicnet.net.au>
Cc: Jonathan Kew <jonathan@jfkew.plus.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <9BA5EEA9-76CF-4B1D-9B30-B92E019F9ACA@iki.fi>
On Feb 5, 2009, at 01:02, Andrew Cunningham wrote:

>>> if a browser can't render combining diacritics, then it will not  
>>> be able to render NFC data when the NFC data uses combining  
>>> diacritics.
>>
>> Right. However, that does not make those browsers useless, because  
>> there are lot of things that can be communicated with precomposed  
>> characters.
>
> Depends on the language.

That's my point.

>>> So for a "legacy" browser when a document contains combining  
>>> diacritics it doesn't matter if the text is NFC or NFD, it will  
>>> not correctly render it.
>>>
>>> For legacy browsers, Unicode will always be a barrier regardless  
>>> of normalisation form.
>>
>> Only for cases where the mapping from characters to graphemes is  
>> not one-to-one. In a lot of cases that have utility, the mapping is  
>> one-to-one.
>
> Most writing scripts aren't simple one-to-one mappings

My point is that saying that Unicode is *always* a barrier for  
software that does one-to-one rendering is hyperbole. There's a  
barrier for scripts where even the NFC form involves combining  
characters. And there's a lot of utility in cases where the barrier  
doesn't apply.

> yep, personally i think default wholesale normalisation would be  
> interesting, defaulting to NFC. But I'd want a mechanism in CSS and  
> in the browsers for the web developer to specify alternative  
> behaviour when required. I think normalisation is required. But I'd  
> also liek to have the flexibility of using the normalisation form  
> appropriate to the web development project at hand.

Isn't the simple way for getting author-controlled normalization to  
let authors normalize at their end? It's something that an author can  
deploy without waiting for a browser upgrade cycle.

>>>>> esp if you also want to comply with certain AAA checkpoints in  
>>>>> WCAG 2.0.
>>>>
>>>> Hold on. What WCAG 2.0 checkpoints require content *not* to be in  
>>>> NFC? If that's the case, there are pretty serious defect  
>>>> *somewhere*.
>>>>
>>> As far as I know WCAG 2.0 is normalisation form agnostic, it  
>>> doesn't require any particular normalisation form. But there is a  
>>> stuff about guidance for pronunciation, and for tonal  African  
>>> languages this means dealing with tone marking (where in day to  
>>> day usage it isn't included) - partly or language learners,  
>>> students and in some case to aid in disambiguating ideas or words.  
>>> It could be handled at the server end or at the client end. To  
>>> handle at the client end, easier to use NFD data, and for  
>>> langauges like Igbo, etc run simple regex to toggle between tonal  
>>> versions and standrad versions.
>>
>> I see. This doesn't mean that serving content in NFD is *required*  
>> only that one implementation strategy for a case that is unusual on  
>> a global scale becomes *easier* if the DOM data is in NFD.
>>
> yes, nor is it an argument against normalisation, rather a  
> recommendation for some control of normalisation forms by the web  
> developer.

To me, it seems like the ability to normalize in one's content  
creation workflow before the data travels over HTTP to a client and  
having JavaScript functions for normalizing strings would give the  
appropriate level of control to Web developers who want it.

> For some of the core Vista fonts I get better typographic display  
> using combining diacritics.


Seems like a font bug if the pre-composed glyphs are worse.

>> I see. Still, I think it's more reasonable that teams whose  
>> multipart graphemes don't have an obvious order for the subparts of  
>> the grapheme bear the cost of dealing with this complex feature of  
>> their writing system and for the sake of performance every browser,  
>> XML parser, etc. around the world on all kinds of devices doesn't  
>> burn cycles (time/electricity/CO₂) just *in case* there happens to  
>> be a string compare where combining characters might have been  
>> inconsistently ordered.
>>
> That asusmes that the development team are even aware of the issue.  
> I wonder how many non-Vietnamese web developers know or understand  
> the impact different input systems will ahve on a Vietnamese project  
> they may be working on.

Do non-Vietnamese Web developers working on Vietnamese content use  
fully diacritical Vietnamese words as computer language identifiers  
such as HTML class names and ids? Is the case of non-Vietnamese  
developers working on a Vietnamese project without proper  
understanding of Vietnamese issues globally important enough a case  
that all software everywhere should burn more cycles when interning  
strings instead of the authoring software used by such teams burning  
more cycles?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Thursday, 5 February 2009 10:39:51 UTC