- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Tue, 3 Feb 2009 10:42:00 +0200
- To: Andrew Cunningham <andrewc@vicnet.net.au>
- Cc: "Jonathan Kew" <jonathan@jfkew.plus.com>, public-i18n-core@w3.org, "W3C Style List" <www-style@w3.org>
On Feb 2, 2009, at 14:54, Andrew Cunningham wrote: > On Mon, February 2, 2009 11:18 pm, Henri Sivonen wrote: >> I think the right place to do normalization for Web formats is in the >> text editor used to write the code, and the normalization form should >> be NFC. >> > > Normalisation form should be what is most appropriate for the task at > hand. There are reasons for using NFC, there are reasons for using > NFD. The central reason for using NFC for interchange (i.e. what goes over HTTP) is that legacy software (including the text rendering code in legacy browsers) works better with NFC. If a given piece of software has a reason to perform operations on NFD internally, in the Web context, the burden is on that piece of software to normalize to NFD on input and to NFD on output. Just like if a piece of software prefers UTF-32 in RAM, it still should do its IO in UTF-8. > although if normalisation is done at the editing level, then the basic > skills and knowledge required for a web developer need to be more > sophisticated than presently available. If the Web developer writes HTML, CSS and JS in an editor that is consistent in the normalization of its output and the author doesn't poke pathological corner cases like starting an HTML or XML text node with a combining solidus, what sophistication does the Web developer need and why? >> If one is only concerned with addressing the issue for conforming >> content or interested in making problems detectable by authors, I >> think it makes to stipulate as an authoring requirement that both the >> unparsed source text and the parsed identifiers be in NFC and make >> validators check this (but not make non-validator consumers do >> anything about it). > > Until UTN 11 v 3 is published i wouldn't normalise text in the Myanmar > script. A situation where normalization would break text seems like a pretty big defect somewhere. Could you please elaborate? > In a number of African languages it is useful to work with NFD data, Even if it useful to perform in-RAM editing operations on NFD in a text editor, it doesn't follow that NFD should be used for interchange. > esp if you also want to comply with certain AAA checkpoints in WCAG > 2.0. Hold on. What WCAG 2.0 checkpoints require content *not* to be in NFC? If that's the case, there are pretty serious defect *somewhere*. In any case, WCAG 2.0 deals with the content perceived by human users. It doesn't deal with the internal identifiers of the technologies used to deliver the content, so WCAG 2.0 isn't relevant to how Selectors or the DOM deal with identifier equality. > Normalisation is critical to web content in a number of languages, not > just the CSS or HTML markup, but the content as well. And some > content, > and some tools benefit from NFC , some from NFD. I believe that > normalisation should be supported, but forcing it to only one > normalisation form isn't optimal. Excluding normalisation also isn't > optimal. This assertion bears strong resemblance to arguments that some languages benefit from UTF-8 while others benefit from UTF-16 and, therefore, both should be supported for interchange. Yet, UTF-8 and UTF-16 are able to represent the same content and empirically UTF-16 is not actually a notable win for markup documents in the languages that allegedly benefit from UTF-16 (because there's so much markup and the markup uses the Basic Latin range). For practical purposes, it would make sense to use UTF-8 for interchange and put the onus of dealing with UTF-16 in-RAM to those tools that want to do it. There's a strong backward compatibility reason to prefer NFC for interchange. If a tool benefits from NFD, using NFD privately in RAM is fine, but leaking it to the Web seems like a bad idea. Leaking it to the Web *inconsistently* and asking consumers to gain complexity to deal seems unreasonable. >> Validator.nu already does this for HTML5, so if >> someone writes a class name with a broken text editor (i.e. one that >> doesn't normalize keyboard input to NFC), the validator can be used >> to >> detect the problem. > > A text editor that doesn't normalise to NFC isn't broken. An ideal > text > editor gives teh user the choice on what normalisation form to use. I can see how the editing buffer in RAM would need to be in a form other than NFC and perhaps in UTF-16 or UTF-32, but why is it desirable to write something other than NFC-normalized UTF-8 to persistent storage or to a network socket? -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Tuesday, 3 February 2009 08:42:47 UTC