Fwd: Unicode normalization issue summary from Arthur Barstow on 2009-02-15 (public-webapps@w3.org from January to March 2009)

From: Arthur Barstow <art.barstow@nokia.com>
Date: Sun, 15 Feb 2009 10:01:41 -0500
To: public-webapps <public-webapps@w3.org>
Message-Id: <E51D086A-D891-4996-BA7F-745750183EC6@nokia.com>
FYI. A summary of the Unicode Normalization issue by Addison Phillips.

Begin forwarded message:

> From: "ext Phillips, Addison" <addison@amazon.com>
> Date: February 12, 2009 6:47:14 PM EST
> Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
> Subject: normalization issue summary for HCG
> Archived-At: <http://www.w3.org/mid/ 
> 4D25F22093241741BC1D0EEBC2DBB1DA0181C82C13@EX-SEA5-D.ant.amazon.com>
>
> All,
>
> Recently, the I18N Core WG raised an issue with Selectors-API and  
> consequently with CSS Selectors related to Unicode Normalization.  
> This note is an attempt to summarize the issue and its attendant  
> history at W3C, as well as to explain the current I18N Core  
> approach to this topic. I am bringing this issue to HTML-CG so that  
> all the affected WGs can be aware of the debate. I realize that  
> this email is very lengthy, but I think it is necessary to fully  
> summarize the issue. Some have wondered about what, exactly, "those  
> I18N guys want from us" and I think it is important to provide a  
> clear position.
>
> At present, I don't have a fixed WG position, since the I18N WG is  
> carefully reviewing its stance. However, this is an issue of  
> sufficient importance and potential impact, that we felt it  
> important to reach out to the broader community now. A summary  
> follows:
>
> ---
>
> First, some background about Unicode:
>
> Unicode encodes a number of compatibility characters whose purpose  
> is to enable interchange with legacy encodings and systems. In  
> addition, Unicode uses characters called "combining marks" in a  
> number of scripts to compose glyphs (visual textual units) that  
> have more than one logical "piece" to them. Both of these cases  
> mean that the same semantically identical "character" can be  
> encoded by more than one Unicode character sequence.
>
> A trivial example would be the character 'é' (a latin small letter  
> 'e' with an acute accent). It can be encoded as U+00E9 or as the  
> sequence U+0065 U+0301. Unicode defines a concept called 'canonical  
> equivalence' in which two strings may be said to be equal  
> semantically even if they do not use the same Unicode code points  
> (characters) in the same order to encode the text. It also defines  
> a canonical decomposition and several normalization forms so that  
> software can transform any canonically equivalent strings into the  
> same code point sequences for comparison and processing purposes.
>
> Unicode also defines an additional level of decomposition, called  
> 'compatibility decomposition', for which additional normalization  
> forms exist. Many compatibility characters, unsurprisingly, have  
> compatibility decompositions. However, characters that share a  
> compatibility decomposition are not considered canonically  
> equivalent. An example of this is the character U+2460 (CIRCLED  
> DIGIT ONE). It has a compatibility decomposition to U+0031 (the  
> digit '1'), but it is not considered canonically equivalent to the  
> number '1' in a string of text.
>
> For the purposes of this document, we are discussing only canonical  
> equivalence and canonical decompositions.
>
> Unicode has two particular rules about the handling of canonical  
> equivalence that concern us here: C6 and C7. C6 says that  
> implementations shall not assume that two canonically equivalent  
> sequences are distinct [there exist reasons why one might not  
> conform to C6]. C7 says that *any* process may replace a character  
> sequence with its canonically equivalent sequence.
>
> Normalization @ W3C:
>
> Which brings us to XML, HTML, CSS, and so forth. I will use XML as  
> the main example here, since it is the REC that most directly  
> addresses these issues. The other standards mainly ignore the issue  
> (they don't require normalization, but they mostly don't address it  
> either).
>
> XML, like many other W3C RECs, uses the Universal Character Set  
> (Unicode) as one of its basic foundations. An XML document, in  
> fact, is a sequence of Unicode code points, even though the actual  
> bits and bytes of a serialized XML file might use some other  
> character encoding. Processing is always specified in terms of  
> Unicode code points (logical characters). Like other W3C RECs, XML  
> says nothing about canonical equivalence in Unicode. There exist  
> recommendations to avoid compatibility characters (which includes  
> some canonical equivalences) and 'Name' prevents starting the  
> various named tokens with certain characters which include  
> combining marks.
>
> Most implementations of XML assume that distinct code point  
> sequences are actually distinct, which is not in keeping with  
> Unicode C6 [1]. That is, if I define one element <!ELEMENT &#xE9;  
> EMPTY> and another element <!ELEMENT &#x65;&#x301; EMPTY>, they are  
> usually considered to be separate elements, even though both define  
> an element that looks like <é/> in a document and even though any  
> text process is allowed to convert one sequence into the other-- 
> according to Unicode. For that matter, a transcoder might produce  
> either of those sequences when converting a file from a non-Unicode  
> legacy encoding.
>
> One might think that this would be a serious problem. However, most  
> software systems consistently use a single Unicode representation  
> to represent most languages/scripts, even though multiple  
> representations are theoretically possible in Unicode. This form is  
> typically very similar to Unicode Normalization Form C (or "NFC"),  
> in which as many combining marks as possible are combined with base  
> characters to form a single code point (NFC also specifies the  
> order in which combining marks that cannot be combined appear; no  
> Unicode normalization form guarantees that *no* combining marks  
> will appear, as some languages cannot be encoded at all except via  
> the use of combining characters). As a result, few users encounter  
> issues with Unicode canonical equivalence.
>
> However, some languages and their writing systems have features  
> that expose or rely on canonical equivalence. For example, some  
> languages make use of combining marks and the order of the  
> combining marks can vary. Other languages use multiple accent marks  
> and their input systems may pre-compose or not compose characters  
> depending on the keyboard layout, operating system, fonts, or the  
> software used to edit text. Vietnamese is an example of this. Since  
> canonically equivalent text is (supposed to be) visually  
> indistinguishable, users typically don't care that (for example)  
> their Mac uses a different code point sequence than their  
> neighbor's Windows computer. These languages are sensitive to  
> canonical equivalence and rely on consistent normalization in order  
> to be used with a technology such as XML. Further, many of the Ur- 
> technologies are now used in combination. For example, a site might  
> use XML for data interchange, XSLT to extract the data into an HTML  
> page for presentation, CSS to style that page, and AJAX for the  
> user to interact with the page.
>
> With this potential problem in mind, eleven (!) years ago the I18N  
> WG started to work on a Specification to address the use of Unicode  
> in W3C specs. This work is collectively called the "Character Model  
> for the World Wide Web" or 'CharMod' for short [2]. Initially, the  
> WG approach was to recommend what was termed "early uniform  
> normalization" (EUN). In EUN, virtually all content and markup was  
> supposed to be in a single normalization form, specifically NFC.  
> The WG identified the need for both document formats (e.g. a file  
> called 'something.xml') to be in NFC as well as the parsed contents  
> of the document (individual elements, attributes, or content within  
> the document). This was called "fully normalized" content.
>
> For this recommendation to work, tools, keyboards, operating  
> environments, text editors, and so on would need to provide for  
> normalizing data either on input (as with most European languages)  
> or when processing, saving, or interacting with the resulting data.  
> Specifications were expected to require normalization whenever  
> possible. In cases where normalization wasn't required at the  
> document format level, de-normalized documents would be 'valid',  
> but could cause warnings to be issued in a validator or de- 
> normalized content could be normalized by tools at creation time.  
> The benefit to this approach was that specifications and certain  
> classes of implementation could mostly assume that users mostly had  
> avoided the problems with canonical equivalence by always authoring  
> documents in the same way. Using this approach, there would be less  
> need, for example, for CSS Selectors to consider normalization,  
> since both the style sheet (or other source of the selector) and  
> the document tree being matched would use the same code point  
> sequence for the same canonically equivalent text. The user-agent  
> could just compare strings. In the few cases where this wasn't the  
> case, the user would be responsible for fixing it, but generally  
> the user was responsible for carefully having constructed the issue  
> in the first place (since their tools and formats would normally  
> have used NFC).
>
> I18N WG worked on this approach for a number of years while  
> languages with normalization risks began to develop appreciable  
> computing support and a nascent Web presence. CharMod contained  
> strong recommendations but not requirements (with a notable  
> exception that we'll cover in a moment) towards normalization.  
> Since it didn't matter so long as content remained normalized and  
> since the most common languages were normalized by default,  
> specifications generally didn't require any level of normalization  
> (although they "should" do so), implementations generally ignored  
> normalization, tools did not implement it, and so forth.
>
> There was one interesting exception to the recommendations in  
> CharMod. String identity matching *required* (MUST) the use of  
> normalization (requirement C312). This nod to canonical equivalence  
> was also ignored by most specs, implementations, and thus content.  
> It should be noted that CSS Selectors is a string identity matching  
> case and not merely one of the "SHOULD" cases.
>
> From being a mostly theoretical problem, normalization has become  
> something that can be demonstrated in real usage scenarios in real  
> languages. While only a quite small percentage of total content is  
> affected, it quite directly impacts specific languages [3]. The  
> I18N WG is engaged in finding out exactly how prevalent this  
> problem is. It is possible that, despite having become a real  
> problem, it is still so strictly limited that it can be dealt with  
> best via other means that spec-level requirements.
>
> In early 2005, the I18N WG decided that EUN as an approach was not  
> tenable because many so many different system components,  
> technologies, and so forth would be affected by a "requirement" to  
> normalize; that some technologies (such as keyboarding systems)  
> were not up to the task; and that, as a result, content would not  
> be normalized uniformly. The decision was made to change from  
> focusing on EUN towards a policy something like:
>
> 1. Recommend the use of normalized content ("EUN") as a "best  
> practice" for content authors and implementers.
> 2. Since content might not be normalized, require specifications  
> affected by normalization to address normalization explicitly.
>
> Surprisingly, none of the requirements are actually changed by this  
> difference in focus. Note that this did not mean that normalization  
> would be required universally; it only meant that WGs would be  
> asked to consider the impact or, in some cases, to change their  
> specification.
>
> In 2008, at the TPAC, I18N WG reviewed the current WD with an eye  
> towards finally completing the normalization portion of the work (a  
> separate working group in the Internationalization Activity has  
> been chartered between 2005 and 2008 to do the work; this working  
> group expired with no progress and "I18N Core" inherited the  
> unfinished work). I18N's review revealed that the current document  
> state was not sufficient for advising spec, content, or  
> implementation authors about when and how to handle the new "late 
> (r)" normalization. The same review produced general  
> acknowledgement that there now existed significant need based on  
> real content for normalization to be handled by W3C Specs.
>
> At the very end of 2008, I18N WG also reviewed the Selectors-API  
> draft produced by WebApps. In reviewing this document, the WG noted  
> that Selectors, upon which API is based, did not address  
> normalization. Other recent REC-track documents had also been  
> advised about normalization and had ended up requiring the use of  
> NFC internally. However, in the case of Selectors-API, the  
> selectors in question were in CSS3 and were in a late working draft  
> state. CSS WG responded to this issue and a long thread has  
> developed on our combined mail lists, in a wiki, and elsewhere.
>
> Over the past two-plus weeks, the I18N WG has solicited advice and  
> comments from within its own community, from Unicode, and from the  
> various CSS (style), XML, and HTML communities. We have embarked on  
> a full-scale review of what position makes the most sense for the  
> W3C to hold. In our most recent conference call (11 February), we  
> asked members and our Unicode liaison to gather information on the  
> overall scope of the problem on the Web today. We also are  
> gathering information on the impact of different kinds of  
> normalization recommendation. We had expected to complete our  
> review at the 11 February concall, but feel we need an additional  
> week.
>
> There are a few points of emerging consensus within I18N. In  
> particular, if normalization is required, such a requirement  
> probably could be limited to identifier tokens, markup, and other  
> formal parts of document formats. Content itself should not  
> generally be required to be normalized (a recommendation should  
> certainly be made and normalization, of course, is always permitted  
> by users or some process--see Unicode C7), in part because there  
> exist use cases for de-normalized content.
>
> The other emerging consensus is that canonical equivalence needs to  
> be dealt with once and for all. WGs should not have the CharMod  
> sword hanging over them and implementers and content authors should  
> get clear guidance. During this review, I18N is considering all  
> possible positions, from "merely" making normalization a best  
> practice to advocating the retrofitting of normalization to our  
> core standards (as appropriate, see above).
>
> One of the oft-cited reasons why normalization should not be  
> introduced is implementation performance. Assuming, for a moment,  
> that documents are allowed to be canonicalized for processing  
> purposes, our experience suggests that overall performance impact  
> can be limited. There exist strategies for checking and normalizing  
> data that are very efficient, in part owing to the relative rarity  
> of denormalized data, even in the affected languages. This document  
> will not attempt to outline the performance cases for or against  
> normalization, except to note that performance *is* an important  
> consideration and *must* be addressed.
>
> I hope this summary is helpful in discussing normalization. I want  
> to raise this issue now so that all of the affected parts of the  
> W3C community can consider this issue and how it affects their  
> specifications/implementations/tests/tools/etc. As a consensus  
> (hopefully) emerges (not just within I18N), we should be in a  
> position to finally resolve the normalization conundrum best  
> proceed to create a global Web that works well for all the world's  
> users.
>
> Kind Regards,
>
> Addison (for I18N)
>
> [1] http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf Unicode  
> Conformance
> [2] http://www.w3.org/TR/charmod-norm/ CharMod-Normalization part
> [3] http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/ 
> 0128.html Ishida example
>   [a] http://lists.w3.org/Archives/Public/public-i18n-core/ 
> 2009JanMar/0182.html and another list
>
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization WG
>
> Internationalization is not a feature.
> It is an architecture.
>
>
Received on Sunday, 15 February 2009 15:02:35 UTC