- From: Arthur Barstow <art.barstow@nokia.com>
- Date: Sun, 15 Feb 2009 10:01:41 -0500
- To: public-webapps <public-webapps@w3.org>
FYI. A summary of the Unicode Normalization issue by Addison Phillips. Begin forwarded message: > From: "ext Phillips, Addison" <addison@amazon.com> > Date: February 12, 2009 6:47:14 PM EST > Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org> > Subject: normalization issue summary for HCG > Archived-At: <http://www.w3.org/mid/ > 4D25F22093241741BC1D0EEBC2DBB1DA0181C82C13@EX-SEA5-D.ant.amazon.com> > > All, > > Recently, the I18N Core WG raised an issue with Selectors-API and > consequently with CSS Selectors related to Unicode Normalization. > This note is an attempt to summarize the issue and its attendant > history at W3C, as well as to explain the current I18N Core > approach to this topic. I am bringing this issue to HTML-CG so that > all the affected WGs can be aware of the debate. I realize that > this email is very lengthy, but I think it is necessary to fully > summarize the issue. Some have wondered about what, exactly, "those > I18N guys want from us" and I think it is important to provide a > clear position. > > At present, I don't have a fixed WG position, since the I18N WG is > carefully reviewing its stance. However, this is an issue of > sufficient importance and potential impact, that we felt it > important to reach out to the broader community now. A summary > follows: > > --- > > First, some background about Unicode: > > Unicode encodes a number of compatibility characters whose purpose > is to enable interchange with legacy encodings and systems. In > addition, Unicode uses characters called "combining marks" in a > number of scripts to compose glyphs (visual textual units) that > have more than one logical "piece" to them. Both of these cases > mean that the same semantically identical "character" can be > encoded by more than one Unicode character sequence. > > A trivial example would be the character 'é' (a latin small letter > 'e' with an acute accent). It can be encoded as U+00E9 or as the > sequence U+0065 U+0301. Unicode defines a concept called 'canonical > equivalence' in which two strings may be said to be equal > semantically even if they do not use the same Unicode code points > (characters) in the same order to encode the text. It also defines > a canonical decomposition and several normalization forms so that > software can transform any canonically equivalent strings into the > same code point sequences for comparison and processing purposes. > > Unicode also defines an additional level of decomposition, called > 'compatibility decomposition', for which additional normalization > forms exist. Many compatibility characters, unsurprisingly, have > compatibility decompositions. However, characters that share a > compatibility decomposition are not considered canonically > equivalent. An example of this is the character U+2460 (CIRCLED > DIGIT ONE). It has a compatibility decomposition to U+0031 (the > digit '1'), but it is not considered canonically equivalent to the > number '1' in a string of text. > > For the purposes of this document, we are discussing only canonical > equivalence and canonical decompositions. > > Unicode has two particular rules about the handling of canonical > equivalence that concern us here: C6 and C7. C6 says that > implementations shall not assume that two canonically equivalent > sequences are distinct [there exist reasons why one might not > conform to C6]. C7 says that *any* process may replace a character > sequence with its canonically equivalent sequence. > > Normalization @ W3C: > > Which brings us to XML, HTML, CSS, and so forth. I will use XML as > the main example here, since it is the REC that most directly > addresses these issues. The other standards mainly ignore the issue > (they don't require normalization, but they mostly don't address it > either). > > XML, like many other W3C RECs, uses the Universal Character Set > (Unicode) as one of its basic foundations. An XML document, in > fact, is a sequence of Unicode code points, even though the actual > bits and bytes of a serialized XML file might use some other > character encoding. Processing is always specified in terms of > Unicode code points (logical characters). Like other W3C RECs, XML > says nothing about canonical equivalence in Unicode. There exist > recommendations to avoid compatibility characters (which includes > some canonical equivalences) and 'Name' prevents starting the > various named tokens with certain characters which include > combining marks. > > Most implementations of XML assume that distinct code point > sequences are actually distinct, which is not in keeping with > Unicode C6 [1]. That is, if I define one element <!ELEMENT é > EMPTY> and another element <!ELEMENT é EMPTY>, they are > usually considered to be separate elements, even though both define > an element that looks like <é/> in a document and even though any > text process is allowed to convert one sequence into the other-- > according to Unicode. For that matter, a transcoder might produce > either of those sequences when converting a file from a non-Unicode > legacy encoding. > > One might think that this would be a serious problem. However, most > software systems consistently use a single Unicode representation > to represent most languages/scripts, even though multiple > representations are theoretically possible in Unicode. This form is > typically very similar to Unicode Normalization Form C (or "NFC"), > in which as many combining marks as possible are combined with base > characters to form a single code point (NFC also specifies the > order in which combining marks that cannot be combined appear; no > Unicode normalization form guarantees that *no* combining marks > will appear, as some languages cannot be encoded at all except via > the use of combining characters). As a result, few users encounter > issues with Unicode canonical equivalence. > > However, some languages and their writing systems have features > that expose or rely on canonical equivalence. For example, some > languages make use of combining marks and the order of the > combining marks can vary. Other languages use multiple accent marks > and their input systems may pre-compose or not compose characters > depending on the keyboard layout, operating system, fonts, or the > software used to edit text. Vietnamese is an example of this. Since > canonically equivalent text is (supposed to be) visually > indistinguishable, users typically don't care that (for example) > their Mac uses a different code point sequence than their > neighbor's Windows computer. These languages are sensitive to > canonical equivalence and rely on consistent normalization in order > to be used with a technology such as XML. Further, many of the Ur- > technologies are now used in combination. For example, a site might > use XML for data interchange, XSLT to extract the data into an HTML > page for presentation, CSS to style that page, and AJAX for the > user to interact with the page. > > With this potential problem in mind, eleven (!) years ago the I18N > WG started to work on a Specification to address the use of Unicode > in W3C specs. This work is collectively called the "Character Model > for the World Wide Web" or 'CharMod' for short [2]. Initially, the > WG approach was to recommend what was termed "early uniform > normalization" (EUN). In EUN, virtually all content and markup was > supposed to be in a single normalization form, specifically NFC. > The WG identified the need for both document formats (e.g. a file > called 'something.xml') to be in NFC as well as the parsed contents > of the document (individual elements, attributes, or content within > the document). This was called "fully normalized" content. > > For this recommendation to work, tools, keyboards, operating > environments, text editors, and so on would need to provide for > normalizing data either on input (as with most European languages) > or when processing, saving, or interacting with the resulting data. > Specifications were expected to require normalization whenever > possible. In cases where normalization wasn't required at the > document format level, de-normalized documents would be 'valid', > but could cause warnings to be issued in a validator or de- > normalized content could be normalized by tools at creation time. > The benefit to this approach was that specifications and certain > classes of implementation could mostly assume that users mostly had > avoided the problems with canonical equivalence by always authoring > documents in the same way. Using this approach, there would be less > need, for example, for CSS Selectors to consider normalization, > since both the style sheet (or other source of the selector) and > the document tree being matched would use the same code point > sequence for the same canonically equivalent text. The user-agent > could just compare strings. In the few cases where this wasn't the > case, the user would be responsible for fixing it, but generally > the user was responsible for carefully having constructed the issue > in the first place (since their tools and formats would normally > have used NFC). > > I18N WG worked on this approach for a number of years while > languages with normalization risks began to develop appreciable > computing support and a nascent Web presence. CharMod contained > strong recommendations but not requirements (with a notable > exception that we'll cover in a moment) towards normalization. > Since it didn't matter so long as content remained normalized and > since the most common languages were normalized by default, > specifications generally didn't require any level of normalization > (although they "should" do so), implementations generally ignored > normalization, tools did not implement it, and so forth. > > There was one interesting exception to the recommendations in > CharMod. String identity matching *required* (MUST) the use of > normalization (requirement C312). This nod to canonical equivalence > was also ignored by most specs, implementations, and thus content. > It should be noted that CSS Selectors is a string identity matching > case and not merely one of the "SHOULD" cases. > > From being a mostly theoretical problem, normalization has become > something that can be demonstrated in real usage scenarios in real > languages. While only a quite small percentage of total content is > affected, it quite directly impacts specific languages [3]. The > I18N WG is engaged in finding out exactly how prevalent this > problem is. It is possible that, despite having become a real > problem, it is still so strictly limited that it can be dealt with > best via other means that spec-level requirements. > > In early 2005, the I18N WG decided that EUN as an approach was not > tenable because many so many different system components, > technologies, and so forth would be affected by a "requirement" to > normalize; that some technologies (such as keyboarding systems) > were not up to the task; and that, as a result, content would not > be normalized uniformly. The decision was made to change from > focusing on EUN towards a policy something like: > > 1. Recommend the use of normalized content ("EUN") as a "best > practice" for content authors and implementers. > 2. Since content might not be normalized, require specifications > affected by normalization to address normalization explicitly. > > Surprisingly, none of the requirements are actually changed by this > difference in focus. Note that this did not mean that normalization > would be required universally; it only meant that WGs would be > asked to consider the impact or, in some cases, to change their > specification. > > In 2008, at the TPAC, I18N WG reviewed the current WD with an eye > towards finally completing the normalization portion of the work (a > separate working group in the Internationalization Activity has > been chartered between 2005 and 2008 to do the work; this working > group expired with no progress and "I18N Core" inherited the > unfinished work). I18N's review revealed that the current document > state was not sufficient for advising spec, content, or > implementation authors about when and how to handle the new "late > (r)" normalization. The same review produced general > acknowledgement that there now existed significant need based on > real content for normalization to be handled by W3C Specs. > > At the very end of 2008, I18N WG also reviewed the Selectors-API > draft produced by WebApps. In reviewing this document, the WG noted > that Selectors, upon which API is based, did not address > normalization. Other recent REC-track documents had also been > advised about normalization and had ended up requiring the use of > NFC internally. However, in the case of Selectors-API, the > selectors in question were in CSS3 and were in a late working draft > state. CSS WG responded to this issue and a long thread has > developed on our combined mail lists, in a wiki, and elsewhere. > > Over the past two-plus weeks, the I18N WG has solicited advice and > comments from within its own community, from Unicode, and from the > various CSS (style), XML, and HTML communities. We have embarked on > a full-scale review of what position makes the most sense for the > W3C to hold. In our most recent conference call (11 February), we > asked members and our Unicode liaison to gather information on the > overall scope of the problem on the Web today. We also are > gathering information on the impact of different kinds of > normalization recommendation. We had expected to complete our > review at the 11 February concall, but feel we need an additional > week. > > There are a few points of emerging consensus within I18N. In > particular, if normalization is required, such a requirement > probably could be limited to identifier tokens, markup, and other > formal parts of document formats. Content itself should not > generally be required to be normalized (a recommendation should > certainly be made and normalization, of course, is always permitted > by users or some process--see Unicode C7), in part because there > exist use cases for de-normalized content. > > The other emerging consensus is that canonical equivalence needs to > be dealt with once and for all. WGs should not have the CharMod > sword hanging over them and implementers and content authors should > get clear guidance. During this review, I18N is considering all > possible positions, from "merely" making normalization a best > practice to advocating the retrofitting of normalization to our > core standards (as appropriate, see above). > > One of the oft-cited reasons why normalization should not be > introduced is implementation performance. Assuming, for a moment, > that documents are allowed to be canonicalized for processing > purposes, our experience suggests that overall performance impact > can be limited. There exist strategies for checking and normalizing > data that are very efficient, in part owing to the relative rarity > of denormalized data, even in the affected languages. This document > will not attempt to outline the performance cases for or against > normalization, except to note that performance *is* an important > consideration and *must* be addressed. > > I hope this summary is helpful in discussing normalization. I want > to raise this issue now so that all of the affected parts of the > W3C community can consider this issue and how it affects their > specifications/implementations/tests/tools/etc. As a consensus > (hopefully) emerges (not just within I18N), we should be in a > position to finally resolve the normalization conundrum best > proceed to create a global Web that works well for all the world's > users. > > Kind Regards, > > Addison (for I18N) > > [1] http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf Unicode > Conformance > [2] http://www.w3.org/TR/charmod-norm/ CharMod-Normalization part > [3] http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/ > 0128.html Ishida example > [a] http://lists.w3.org/Archives/Public/public-i18n-core/ > 2009JanMar/0182.html and another list > > Addison Phillips > Globalization Architect -- Lab126 > Chair -- W3C Internationalization WG > > Internationalization is not a feature. > It is an architecture. > >
Received on Sunday, 15 February 2009 15:02:35 UTC