Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Henri Sivonen on 2009-02-04 (public-i18n-core@w3.org from January to March 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 4 Feb 2009 14:36:16 +0200
To: Andrew Cunningham <andrewc@vicnet.net.au>
Cc: Jonathan Kew <jonathan@jfkew.plus.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <A760E8BC-75CE-4EAC-A146-FC0D0ABEA7F7@iki.fi>
On Feb 4, 2009, at 00:44, Andrew Cunningham wrote:

> Henri Sivonen wrote:
>
>> The central reason for using NFC for interchange (i.e. what goes  
>> over HTTP) is that legacy software (including the text rendering  
>> code in legacy browsers) works better with NFC.
>>
> I'd be interested in knowing what oyu'd define as legacy browsers,  
> and which operating systems you have in mind when you emntion it.

For the purpose of the quoted sentence, I'd define versions of the  
major browsers other than the most recent release version of each one  
as "legacy". (For the purpose of specifying new behaviors in specs,  
even the now-current versions should be considered legacy from the  
point of view of the potential future spec feature.)

> if a browser can't render combining diacritics, then it will not be  
> able to render NFC data when the NFC data uses combining diacritics.

Right. However, that does not make those browsers useless, because  
there are lot of things that can be communicated with precomposed  
characters.

> So for a "legacy" browser when a document contains combining  
> diacritics it doesn't matter if the text is NFC or NFD, it will not  
> correctly render it.
>
> For legacy browsers, Unicode will always be a barrier regardless of  
> normalisation form.

Only for cases where the mapping from characters to graphemes is not  
one-to-one. In a lot of cases that have utility, the mapping is one-to- 
one.

>>>> If one is only concerned with addressing the issue for conforming
>>>> content or interested in making problems detectable by authors, I
>>>> think it makes to stipulate as an authoring requirement that both  
>>>> the
>>>> unparsed source text and the parsed identifiers be in NFC and make
>>>> validators check this (but not make non-validator consumers do
>>>> anything about it).
>>>
>>> Until UTN 11 v 3 is published i wouldn't normalise text in the  
>>> Myanmar
>>> script.
>>
>> A situation where normalization would break text seems like a  
>> pretty big defect somewhere. Could you please elaborate?
>>
> There are discrepancies between canonical ordering in normalisation  
> for some Myanmar characters compared to the data storage order  
> recommended in UTN11. Current Unicode 5.1 fonts for the Myanmar  
> block as based on UNT11. I believe Martin H is working on a draft of  
> version 3 of UTN11 (esp. since UTN11 was Burmese centric and also  
> needs to address a range of issues with ethnic minority languages,  
> Pali and Sanskrit). Very few if any web sites actually normalise  
> content, Wikipedia and the Mediawiki platform being one of those.  
> From memory the problem came to light when trying to work out  
> rendering problems in the Burmese version of Wikipedia. Haven't  
> followed the discussion in any detail and have only had second hand  
> reports on the meetings in Yangon last year.

So this is not a fundamental objection to normalization but a bug in  
coordinating between UTNs and the bug is expected to be transient?

>>> In a number of African languages it is useful to work with NFD data,
>>
>> Even if it useful to perform in-RAM editing operations on NFD in a  
>> text editor, it doesn't follow that NFD should be used for  
>> interchange.
>>
> except where it is useful process NFD data in a client side script.

That's indeed an interesting case. Considering the subject matter of  
this thread, it isn't an argument *for* browsers normalizing without  
the author asking them to but *against* browsers normalizing in CSS/ 
HTML/JS parsers without the author asking them to.

>>> esp if you also want to comply with certain AAA checkpoints in  
>>> WCAG 2.0.
>>
>> Hold on. What WCAG 2.0 checkpoints require content *not* to be in  
>> NFC? If that's the case, there are pretty serious defect *somewhere*.
>>
> As far as I know WCAG 2.0 is normalisation form agnostic, it doesn't  
> require any particular normalisation form. But there is a stuff  
> about guidance for pronunciation, and for tonal  African languages  
> this means dealing with tone marking (where in day to day usage it  
> isn't included) - partly or language learners, students and in some  
> case to aid in disambiguating ideas or words. It could be handled at  
> the server end or at the client end. To handle at the client end,  
> easier to use NFD data, and for langauges like Igbo, etc run simple  
> regex to toggle between tonal versions and standrad versions.

I see. This doesn't mean that serving content in NFD is *required*  
only that one implementation strategy for a case that is unusual on a  
global scale becomes *easier* if the DOM data is in NFD.

>> I can see how the editing buffer in RAM would need to be in a form  
>> other than NFC and perhaps in UTF-16 or UTF-32, but why is it  
>> desirable to write something other than NFC-normalized UTF-8 to  
>> persistent storage or to a network socket?
>>
> So you are suggesting all files should use NFC when transmitted to  
> the browser, and at the client end convert to NFD when needed to be  
> processed in that form?

Yes, that's what I was suggesting. Your client side scripting point  
is, however, somewhat persuasive against that suggestion. However, I  
think your client side scripting point support the suggestion I had  
for how browsers should deal (i.e. not normalizing on the HTML/XML/DOM/ 
JS/CSS layers and only performing (or delegating to system)  
normalization-related operations when deciding which glyphs to  
*render* and when cooking user text input for placing it into the DOM).

On Feb 4, 2009, at 00:57, Andrew Cunningham wrote:

> Henri Sivonen wrote:
>
>> Surely a MARC21-based system should then use this data  
>> representation internally but not leak it to a Web UI for the system.
>>
> I agree whole heartedly
>
> But the reality is very fewILMSs normalise


This line of argument is very similar to the argument that screen  
readers are updated so rarely that every thing around them has to  
accommodate their quirks. It leads to a situation where browsers in  
particular and Web client software in general pile on heaps and heaps  
of complexity in order to deal with all the bugs/quirks/peculiarities  
with all the other software in the ecosystem whose developers just  
don't bother.

(FWIW, my main interest in this thread is avoiding a situation where  
I, as an HTML parser implementor, would one day find a requirement to  
perform normalization-sensitive operations in an HTML parser.)

On Feb 4, 2009, at 01:09, Andrew Cunningham wrote:

> Henri Sivonen wrote:
>
>> On Feb 2, 2009, at 21:02, Phillips, Addison wrote:
>>
>>> Because browsers are NOT the primary creator of the content. Early  
>>> uniform normalization refers to every process that creates an XML/ 
>>> HTML/CSS/etc etc. document. The browser reads those documents and  
>>> must still deal with normalization issues.
>>
>>
>> To me, it seems unreasonable to introduce serious performance- 
>> sensitive complexity into Web content consumers to address the case  
>> that a Web developer fails to supply HTML, CSS and JS in a  
>> *consistent* form in terms of combining characters. (I think even  
>> normalization in the HTML parser post-entity expansion would be  
>> undesirable.) How big a problem is it in practice that an author  
>> fails to be self-consistent when writing class names to .html and  
>> when writing them to .css or .js?
>>
> You are assuming that only one developer is working on the project.
>
> When there is a team of developers it gets more murky for some  
> languages. The operating systems being used and the keyboard layouts  
> they are using  will effect the sequence of codepoints being  
> generated. The reality is that for some languages there are multiple  
> layouts and input mechanisms. For a language like Vietnamese most  
> input systems fall into one of two categories, NFC or the Microsoft  
> format. Although if you really want too, some of those input tools  
> could give you NFD if you wanted.  last time i tested a range of  
> Yoruba keyboard layouts and input mechanisms, i got at least four  
> different approaches to codepoint generation. There is no guarantee  
> that everyone on the team is using the same input software or  
> operating system.

I see. Still, I think it's more reasonable that teams whose multipart  
graphemes don't have an obvious order for the subparts of the grapheme  
bear the cost of dealing with this complex feature of their writing  
system and for the sake of performance every browser, XML parser, etc.  
around the world on all kinds of devices doesn't burn cycles (time/ 
electricity/CO₂) just *in case* there happens to be a string compare  
where combining characters might have been inconsistently ordered.

Before anyone accuses me of “Western bias”, I'd like to point out  
that there is precedent for preferring (in my opinion quite  
reasonably) a faster kind of Unicode string equality relation over one  
that could be claimed to make more sense to users of Western  
languages: XML deliberately uses code point for code point string  
comparison for start/end tag names instead of Unicode-wise correct  
case-insensitive comparison for performance reasons (and to avoid  
having to pick a collation locale). (I'm assuming here that we can  
agree that bicameral scripts are big in the “West” and case- 
insensitive equality makes some intuitive sense for the users of  
bicameral scripts.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 4 February 2009 12:37:00 UTC