Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Henri Sivonen on 2009-02-03 (www-style@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 3 Feb 2009 10:42:00 +0200
To: Andrew Cunningham <andrewc@vicnet.net.au>
Cc: "Jonathan Kew" <jonathan@jfkew.plus.com>, public-i18n-core@w3.org, "W3C Style List" <www-style@w3.org>
Message-Id: <BCC68A3C-8BB0-4526-B92D-0F7C8D78E525@iki.fi>
On Feb 2, 2009, at 14:54, Andrew Cunningham wrote:

> On Mon, February 2, 2009 11:18 pm, Henri Sivonen wrote:
>> I think the right place to do normalization for Web formats is in the
>> text editor used to write the code, and the normalization form should
>> be NFC.
>>
>
> Normalisation form should be what is most appropriate for the task at
> hand. There are reasons for using NFC, there are reasons for using  
> NFD.

The central reason for using NFC for interchange (i.e. what goes over  
HTTP) is that legacy software (including the text rendering code in  
legacy browsers) works better with NFC.

If a given piece of software has a reason to perform operations on NFD  
internally, in the Web context, the burden is on that piece of  
software to normalize to NFD on input and to NFD on output. Just like  
if a piece of software prefers UTF-32 in RAM, it still should do its  
IO in UTF-8.

> although if normalisation is done at the editing level, then the basic
> skills and knowledge required for a web developer need to be more
> sophisticated than presently available.

If the Web developer writes HTML, CSS and JS in an editor that is  
consistent in the normalization of its output and the author doesn't  
poke pathological corner cases like starting an HTML or XML text node  
with a combining solidus, what sophistication does the Web developer  
need and why?

>> If one is only concerned with addressing the issue for conforming
>> content or interested in making problems detectable by authors, I
>> think it makes to stipulate as an authoring requirement that both the
>> unparsed source text and the parsed identifiers be in NFC and make
>> validators check this (but not make non-validator consumers do
>> anything about it).
>
> Until UTN 11 v 3 is published i wouldn't normalise text in the Myanmar
> script.

A situation where normalization would break text seems like a pretty  
big defect somewhere. Could you please elaborate?

> In a number of African languages it is useful to work with NFD data,

Even if it useful to perform in-RAM editing operations on NFD in a  
text editor, it doesn't follow that NFD should be used for interchange.

> esp if you also want to comply with certain AAA checkpoints in WCAG  
> 2.0.

Hold on. What WCAG 2.0 checkpoints require content *not* to be in NFC?  
If that's the case, there are pretty serious defect *somewhere*.

In any case, WCAG 2.0 deals with the content perceived by human users.  
It doesn't deal with the internal identifiers of the technologies used  
to deliver the content, so WCAG 2.0 isn't relevant to how Selectors or  
the DOM deal with identifier equality.

> Normalisation is critical to web content in a number of languages, not
> just the CSS or HTML markup, but the content as well. And some  
> content,
> and some tools benefit from NFC , some from NFD. I believe that
> normalisation should be supported, but forcing it to only one
> normalisation form isn't optimal. Excluding normalisation also isn't
> optimal.

This assertion bears strong resemblance to arguments that some  
languages benefit from UTF-8 while others benefit from UTF-16 and,  
therefore, both should be supported for interchange. Yet, UTF-8 and  
UTF-16 are able to represent the same content and empirically UTF-16  
is not actually a notable win for markup documents in the languages  
that allegedly benefit from UTF-16 (because there's so much markup and  
the markup uses the Basic Latin range). For practical purposes, it  
would make sense to use UTF-8 for interchange and put the onus of  
dealing with UTF-16 in-RAM to those tools that want to do it.

There's a strong backward compatibility reason to prefer NFC for  
interchange. If a tool benefits from NFD, using NFD privately in RAM  
is fine, but leaking it to the Web seems like a bad idea. Leaking it  
to the Web *inconsistently* and asking consumers to gain complexity to  
deal seems unreasonable.

>> Validator.nu already does this for HTML5, so if
>> someone writes a class name with a broken text editor (i.e. one that
>> doesn't normalize keyboard input to NFC), the validator can be used  
>> to
>> detect the problem.
>
> A text editor that doesn't normalise to NFC isn't broken. An ideal  
> text
> editor gives teh user the choice on what normalisation form to use.

I can see how the editing buffer in RAM would need to be in a form  
other than NFC and perhaps in UTF-16 or UTF-32, but why is it  
desirable to write something other than NFC-normalized UTF-8 to  
persistent storage or to a network socket?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 3 February 2009 08:42:47 UTC