- From: Robert J Burns <rob@robburns.com>
- Date: Tue, 3 Feb 2009 04:04:07 -0600
- To: public-i18n-core@w3.org, jonathan@jfkew.plus.com
- Cc: W3C Style List <www-style@w3.org>
- Message-Id: <34A6FB79-87DB-4F39-99CC-587F4799FB41@robburns.com>
Hi Henri, > On Mon, February 2, 2009 11:18 pm, Henri Sivonen wrote: > >>On Feb 2, 2009, at 14:54, Andrew Cunningham wrote: > >> I think the right place to do normalization for Web formats is in > the > >> text editor used to write the code, and the normalization form > should > >> be NFC. > >> > > > > Normalisation form should be what is most appropriate for the task > at > > hand. There are reasons for using NFC, there are reasons for using > > NFD. > > The central reason for using NFC for interchange (i.e. what goes over > HTTP) is that legacy software (including the text rendering code in > legacy browsers) works better with NFC. > > If a given piece of software has a reason to perform operations on NFD > internally, in the Web context, the burden is on that piece of > software to normalize to NFD on input and to NFD on output. Just like > if a piece of software prefers UTF-32 in RAM, it still should do its > IO in UTF-8. The problem with this is that there would have to be a prior agreement so that a Unicode processing application could count on everything received already as NFC and that's simply not the case. If a Unicode UA is incapable of processing NFD (which also implies it cannot process NFC characters that are combining characters) then it would be up to that application to convert internally to something it could handle (just what conversion it would do, I don't know). > > although if normalisation is done at the editing level, then the > basic > > skills and knowledge required for a web developer need to be more > > sophisticated than presently available. > > If the Web developer writes HTML, CSS and JS in an editor that is > consistent in the normalization of its output and the author doesn't > poke pathological corner cases like starting an HTML or XML text node > with a combining solidus, what sophistication does the Web developer > need and why? > > >> If one is only concerned with addressing the issue for conforming > >> content or interested in making problems detectable by authors, I > >> think it makes to stipulate as an authoring requirement that both > the > >> unparsed source text and the parsed identifiers be in NFC and make > >> validators check this (but not make non-validator consumers do > >> anything about it). > > > > Until UTN 11 v 3 is published i wouldn't normalise text in the > Myanmar > > script. > > A situation where normalization would break text seems like a pretty > big defect somewhere. Could you please elaborate? > > > In a number of African languages it is useful to work with NFD data, > > Even if it useful to perform in-RAM editing operations on NFD in a > text editor, it doesn't follow that NFD should be used for > interchange. I think you're making many incorrect assumptions about NFC superiority. There's not many simplifications in processing NFC since NFC does not eliminate combining marks. > [snip] > > Normalisation is critical to web content in a number of languages, > not > > just the CSS or HTML markup, but the content as well. And some > > content, > > and some tools benefit from NFC , some from NFD. I believe that > > normalisation should be supported, but forcing it to only one > > normalisation form isn't optimal. Excluding normalisation also isn't > > optimal. > > This assertion bears strong resemblance to arguments that some > languages benefit from UTF-8 while others benefit from UTF-16 and, > therefore, both should be supported for interchange. Yet, UTF-8 and > UTF-16 are able to represent the same content and empirically UTF-16 > is not actually a notable win for markup documents in the languages > that allegedly benefit from UTF-16 (because there's so much markup and > the markup uses the Basic Latin range). For practical purposes, it > would make sense to use UTF-8 for interchange and put the onus of > dealing with UTF-16 in-RAM to those tools that want to do it. Again, you're making assumptions that simply don't hold water. For documents in languages where UTF-8 requires 3-octets per code point there can certainly be enough 1-octet Latin script markup to offset the 3-octets in the natural language element content (averaging out to the 2-octets per code point of UTF-16), but those would likely be rare documents. Regardless of the statistical count of different document makeup, there are certainly documents of this type where more than half of the characters are non-latin script content requiring 3-octets in UTF-8 and therefore leaner in UTF-16. And since UAs need to support UTF-16 anyway, we're not saving any implementation headaches by discouraging UTF-16 for such documents. And for non-HTML documents even the markup might be in UTF-8 3-octet characters. For Cuneiform documents, for example, we're talking 4-octets in UTF-8 and UTF-16 will be much leaner than UTF-8. > There's a strong backward compatibility reason to prefer NFC for > interchange. You keep saying that but I cannot imagine what it could > be. The only thing NFC would do for backwards compatibility is allow > buggy Unicode implementations to mask their deficiencies by reducing > the number of combining characters they needed to deal with. > > If a tool benefits from NFD, using NFD privately in RAM > is fine, but leaking it to the Web seems like a bad idea. Leaking it > to the Web *inconsistently* and asking consumers to gain complexity to > deal seems unreasonable. > I don't see how this could be correct. Can you provide some reasons > why you think this? > > >> Validator.nu already does this for HTML5, so if > >> someone writes a class name with a broken text editor (i.e. one > that > >> doesn't normalize keyboard input to NFC), the validator can be used > >> to > >> detect the problem. > > > > A text editor that doesn't normalise to NFC isn't broken. An ideal > > text > > editor gives teh user the choice on what normalisation form to use. > > I can see how the editing buffer in RAM would need to be in a form > other than NFC and perhaps in UTF-16 or UTF-32, but why is it > desirable to write something other than NFC-normalized UTF-8 to > persistent storage or to a network socket? Any script beyond U+07FF will take more octets to encode as UTF-8 than UTF-16. Regardless, both need to be supported by implementations and the savings of using UTF-8 for some documents and UTF-16 for the others isn't worth the trouble. For example, for a Chinese website they probably can count on most of their documents being more efficient encoded as UTF-16 than UTF-8 and the rare exceptions aren't worth looking out for. As for normalized forms, I said in an earlier message that it would be best to recommend encoding in NFC and to avoid compatibility characters (not just canonically equivalent, but also the compatibility equivalent in most cases) in markup and content (perhaps even prohibited in markup). However, it is not something implementations can count on anyway. So Unicode implementations need to support both forms and be able to compare canonically equivalent strings correctly regardless of the normalization form. Also, Unicode implementations need to be able to handle combining characters even if they first convert everything to NFC so that doesn't save them from combining character handling either. Anne van Kesteren wrote: > I never pointed to XML 1.1. I did point out that the above section > was non-normative and for some reason had a normative reference to > Unicode Normalization, which seems like a bug. I don't really care > whether it's a bad idea or not, it would a bug in our software if we > normalized on input unless XML was somehow changed. Unicode is cited normatively as the text model for XML so it follows that comparing two canonically equivalent strings and identifying them as distinct is going against the norms of Unicode and therefore XML. Earlier Martin Duerst cited (C6) [1]: > o The implications of this conformance clause are twofold. First, a > process is never required to give different interpretations to two > different, but cannonical-equivalent character sequences. Second, no > process can assume that another process will make a distinction > between two different, but canonical-equivalent character sequences. > o Ideally, an implemenation would always interpret two canonical- > equivalent character sequences identically. There are practical > circumstances under which implementations may reasonably distinguish > them. While this leaves open the possibility of a UA needing to accommodate author input errors it's not clear to me why you would interpret XML as one of those applications needing that leeway. This is something that HTML5 should also be considering since it is probably in the parsing algorithm where this NFC normalization belongs and the two W3C recommendations most involved here are XML and HTML5. Earlier you raised font matching as a possible issue, but a font whose glyphs aren't assigned to both canonical equivalent is a font with bugs in it. So performing canonical equivalence normalization fixes font issues (especially in that the buggy font could also reinforce misuse of Unicode canonically equivalent characters). Take care, Rob [1]: <http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0071.html >
Received on Tuesday, 3 February 2009 10:04:49 UTC