Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

Hi Henri,

Henri Sivonen wrote:

>
> The central reason for using NFC for interchange (i.e. what goes over 
> HTTP) is that legacy software (including the text rendering code in 
> legacy browsers) works better with NFC.
>
I'd be interested in knowing what oyu'd define as legacy browsers, and 
which operating systems you have in mind when you emntion it.

if a browser can't render combining diacritics, then it will not be able 
to render NFC data when the NFC data uses combining diacritics. So for a 
"legacy" browser when a document contains combining diacritics it 
doesn't matter if the text is NFC or NFD, it will not correctly render it.

For legacy browsers, Unicode will always be a barrier regardless of 
normalisation form.
> If a given piece of software has a reason to perform operations on NFD 
> internally, in the Web context, the burden is on that piece of 
> software to normalize to NFD on input and to NFD on output. Just like 
> if a piece of software prefers UTF-32 in RAM, it still should do its 
> IO in UTF-8.
>
>> although if normalisation is done at the editing level, then the basic
>> skills and knowledge required for a web developer need to be more
>> sophisticated than presently available.
>
> If the Web developer writes HTML, CSS and JS in an editor that is 
> consistent in the normalization of its output and the author doesn't 
> poke pathological corner cases like starting an HTML or XML text node 
> with a combining solidus, what sophistication does the Web developer 
> need and why?
>
Beyond normalisation?  I can think of lots of things that i expect the 
web developers I work with to know.

I don't expect all editing tools to normalise, and if they do, I'd 
expect the editing tool to give me a choice of normalisation forms.

In terms of normalisation, I'd expect a web developer to know what his 
editing tools do, whether they normalise or not. If they normalise, what 
form they use. When a task requires NFC (many cases) and when a task 
could use NFD (some cases if client side scripting is used, no cases if 
everything uses server side scripting).

>>> If one is only concerned with addressing the issue for conforming
>>> content or interested in making problems detectable by authors, I
>>> think it makes to stipulate as an authoring requirement that both the
>>> unparsed source text and the parsed identifiers be in NFC and make
>>> validators check this (but not make non-validator consumers do
>>> anything about it).
>>
>> Until UTN 11 v 3 is published i wouldn't normalise text in the Myanmar
>> script.
>
> A situation where normalization would break text seems like a pretty 
> big defect somewhere. Could you please elaborate?
>
There are discrepancies between canonical ordering in normalisation for 
some Myanmar characters compared to the data storage order recommended 
in UTN11. Current Unicode 5.1 fonts for the Myanmar block as based on 
UNT11. I believe Martin H is working on a draft of version 3 of UTN11 
(esp. since UTN11 was Burmese centric and also needs to address a range 
of issues with ethnic minority languages, Pali and Sanskrit). Very few 
if any web sites actually normalise content, Wikipedia and the Mediawiki 
platform being one of those. From memory the problem came to light when 
trying to work out rendering problems in the Burmese version of 
Wikipedia. Haven't followed the discussion in any detail and have only 
had second hand reports on the meetings in Yangon last year.
>> In a number of African languages it is useful to work with NFD data,
>
> Even if it useful to perform in-RAM editing operations on NFD in a 
> text editor, it doesn't follow that NFD should be used for interchange.
>
except where it is useful process NFD data in a client side script.
>> esp if you also want to comply with certain AAA checkpoints in WCAG 2.0.
>
> Hold on. What WCAG 2.0 checkpoints require content *not* to be in NFC? 
> If that's the case, there are pretty serious defect *somewhere*.
>
As far as I know WCAG 2.0 is normalisation form agnostic, it doesn't 
require any particular normalisation form. But there is a stuff about 
guidance for pronunciation, and for tonal  African languages this means 
dealing with tone marking (where in day to day usage it isn't included) 
- partly or language learners, students and in some case to aid in 
disambiguating ideas or words. It could be handled at the server end or 
at the client end. To handle at the client end, easier to use NFD data, 
and for langauges like Igbo, etc run simple regex to toggle between 
tonal versions and standrad versions.
> In any case, WCAG 2.0 deals with the content perceived by human users. 
> It doesn't deal with the internal identifiers of the technologies used 
> to deliver the content, so WCAG 2.0 isn't relevant to how Selectors or 
> the DOM deal with identifier equality.
>
Yes i agree, but those comments were not about selectors. It was about 
the use of NFD in content intended to be read as distinct form markup or 
selectors.

>
> I can see how the editing buffer in RAM would need to be in a form 
> other than NFC and perhaps in UTF-16 or UTF-32, but why is it 
> desirable to write something other than NFC-normalized UTF-8 to 
> persistent storage or to a network socket?
>
So you are suggesting all files should use NFC when transmitted to the 
browser, and at the client end convert to NFD when needed to be 
processed in that form?

-- 
Andrew Cunningham
Senior Manager, Research and Development
Vicnet
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000

Ph: +61-3-8664-7430
Fax: +61-3-9639-2175

Email: andrewc@vicnet.net.au
Alt email: lang.support@gmail.com

http://home.vicnet.net.au/~andrewc/
http://www.openroad.net.au
http://www.vicnet.net.au
http://www.slv.vic.gov.au

Received on Tuesday, 3 February 2009 22:45:38 UTC