Unicode Normalization

Hi Henri,

 > On Mon, February 2, 2009 11:18 pm, Henri Sivonen wrote:
> >>On Feb 2, 2009, at 14:54, Andrew Cunningham wrote:
> >> I think the right place to do normalization for Web formats is in  
> the
> >> text editor used to write the code, and the normalization form  
> should
> >> be NFC.
> >>
> >
> > Normalisation form should be what is most appropriate for the task  
> at
> > hand. There are reasons for using NFC, there are reasons for using
> > NFD.
>
> The central reason for using NFC for interchange (i.e. what goes over
> HTTP) is that legacy software (including the text rendering code in
> legacy browsers) works better with NFC.
>
> If a given piece of software has a reason to perform operations on NFD
> internally, in the Web context, the burden is on that piece of
> software to normalize to NFD on input and to NFD on output. Just like
> if a piece of software prefers UTF-32 in RAM, it still should do its
> IO in UTF-8.

The problem with this is that there would have to be a prior agreement  
so that a Unicode processing application could count on everything  
received already as NFC and that's simply not the case. If a Unicode  
UA is incapable of processing NFD (which also implies it cannot  
process NFC characters that are combining characters) then it would be  
up to that application to convert internally to something it could  
handle (just what conversion it would do, I don't know).

> > although if normalisation is done at the editing level, then the  
> basic
> > skills and knowledge required for a web developer need to be more
> > sophisticated than presently available.
>
> If the Web developer writes HTML, CSS and JS in an editor that is
> consistent in the normalization of its output and the author doesn't
> poke pathological corner cases like starting an HTML or XML text node
> with a combining solidus, what sophistication does the Web developer
> need and why?
>
> >> If one is only concerned with addressing the issue for conforming
> >> content or interested in making problems detectable by authors, I
> >> think it makes to stipulate as an authoring requirement that both  
> the
> >> unparsed source text and the parsed identifiers be in NFC and make
> >> validators check this (but not make non-validator consumers do
> >> anything about it).
> >
> > Until UTN 11 v 3 is published i wouldn't normalise text in the  
> Myanmar
> > script.
>
> A situation where normalization would break text seems like a pretty
> big defect somewhere. Could you please elaborate?
>
> > In a number of African languages it is useful to work with NFD data,
>
> Even if it useful to perform in-RAM editing operations on NFD in a
> text editor, it doesn't follow that NFD should be used for  
> interchange.

I think you're making many incorrect assumptions about NFC  
superiority. There's not many simplifications in processing NFC since  
NFC does not eliminate combining marks.

> [snip]
> > Normalisation is critical to web content in a number of languages,  
> not
> > just the CSS or HTML markup, but the content as well. And some
> > content,
> > and some tools benefit from NFC , some from NFD. I believe that
> > normalisation should be supported, but forcing it to only one
> > normalisation form isn't optimal. Excluding normalisation also isn't
> > optimal.
>
> This assertion bears strong resemblance to arguments that some
> languages benefit from UTF-8 while others benefit from UTF-16 and,
> therefore, both should be supported for interchange. Yet, UTF-8 and
> UTF-16 are able to represent the same content and empirically UTF-16
> is not actually a notable win for markup documents in the languages
> that allegedly benefit from UTF-16 (because there's so much markup and
> the markup uses the Basic Latin range). For practical purposes, it
> would make sense to use UTF-8 for interchange and put the onus of
> dealing with UTF-16 in-RAM to those tools that want to do it.

Again, you're making assumptions that simply don't hold water. For  
documents in languages where UTF-8 requires 3-octets per code point  
there can certainly be enough 1-octet Latin script markup to offset  
the 3-octets in the natural language element content (averaging out to  
the 2-octets per code point of UTF-16), but those would likely be rare  
documents. Regardless of the statistical count of different document  
makeup, there are certainly documents of this type where more than  
half of the characters are non-latin script content requiring 3-octets  
in UTF-8 and therefore leaner in UTF-16. And since UAs need to support  
UTF-16 anyway, we're not saving any implementation headaches by  
discouraging UTF-16 for such documents. And for non-HTML documents  
even the markup might be in UTF-8 3-octet characters. For Cuneiform  
documents, for example, we're talking 4-octets in UTF-8 and UTF-16  
will be much leaner than UTF-8.

> There's a strong backward compatibility reason to prefer NFC for
> interchange. You keep saying that but I cannot imagine what it could  
> be. The only thing NFC would do for backwards compatibility is allow  
> buggy Unicode implementations to mask their deficiencies by reducing  
> the number of combining characters they needed to deal with.
>
> If a tool benefits from NFD, using NFD privately in RAM
> is fine, but leaking it to the Web seems like a bad idea. Leaking it
> to the Web *inconsistently* and asking consumers to gain complexity to
> deal seems unreasonable.
> I don't see how this could be correct. Can you provide some reasons  
> why you think this?
>
> >> Validator.nu already does this for HTML5, so if
> >> someone writes a class name with a broken text editor (i.e. one  
> that
> >> doesn't normalize keyboard input to NFC), the validator can be used
> >> to
> >> detect the problem.
> >
> > A text editor that doesn't normalise to NFC isn't broken. An ideal
> > text
> > editor gives teh user the choice on what normalisation form to use.
>
> I can see how the editing buffer in RAM would need to be in a form
> other than NFC and perhaps in UTF-16 or UTF-32, but why is it
> desirable to write something other than NFC-normalized UTF-8 to
> persistent storage or to a network socket?

Any script beyond U+07FF will take more octets to encode as UTF-8 than  
UTF-16. Regardless, both need to be supported by implementations and  
the savings of using UTF-8 for some documents and UTF-16 for the  
others isn't worth the trouble. For example, for a Chinese website  
they probably can count on most of their documents being more  
efficient encoded as UTF-16 than UTF-8 and the rare exceptions aren't  
worth looking out for.

As for normalized forms, I said in an earlier message that it would be  
best to recommend encoding in NFC and to avoid compatibility  
characters (not just canonically equivalent, but also the  
compatibility equivalent in most cases) in markup and content (perhaps  
even prohibited in markup). However, it is not something  
implementations can count on anyway. So Unicode implementations need  
to support both forms and be able to compare canonically equivalent  
strings correctly regardless of the normalization form. Also, Unicode  
implementations need to be able to handle combining characters even if  
they first convert everything to NFC so that doesn't save them from  
combining character handling either.

Anne van Kesteren wrote:
> I never pointed to XML 1.1. I did point out that the above section  
> was non-normative and for some reason had a normative reference to  
> Unicode Normalization, which seems like a bug. I don't really care  
> whether it's a bad idea or not, it would a bug in our software if we  
> normalized on input unless XML was somehow changed.

Unicode is cited normatively as the text model for XML so it follows  
that comparing two canonically equivalent strings and identifying them  
as distinct is going against the norms of Unicode and therefore XML.  
Earlier Martin Duerst cited (C6) [1]:

> o The implications of this conformance clause are twofold. First, a  
> process is never required to give different interpretations to two  
> different, but cannonical-equivalent character sequences. Second, no  
> process can assume that another process will make a distinction  
> between two different, but canonical-equivalent character sequences.
> o Ideally, an implemenation would always interpret two canonical-  
> equivalent character sequences identically. There are practical  
> circumstances under which implementations may reasonably distinguish  
> them.
While this leaves open the possibility of a UA needing to accommodate  
author input errors it's not clear to me why you would interpret XML  
as one of those applications needing that leeway. This is something  
that HTML5 should also be considering since it is probably in the  
parsing algorithm where this NFC normalization belongs and the two W3C  
recommendations most involved here are XML and HTML5.

Earlier you raised font matching as a possible issue, but a font whose  
glyphs aren't assigned to both canonical equivalent is a font with  
bugs in it. So performing canonical equivalence normalization fixes  
font issues (especially in that the buggy font could also reinforce  
misuse of Unicode canonically equivalent characters).

Take care,
Rob

[1]: <http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0071.html 
 >

Received on Tuesday, 3 February 2009 10:04:50 UTC