- From: Arthur Barstow <art.barstow@nokia.com>
- Date: Sun, 15 Feb 2009 10:01:41 -0500
- To: public-webapps <public-webapps@w3.org>
FYI. A summary of the Unicode Normalization issue by Addison Phillips.
Begin forwarded message:
> From: "ext Phillips, Addison" <addison@amazon.com>
> Date: February 12, 2009 6:47:14 PM EST
> Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
> Subject: normalization issue summary for HCG
> Archived-At: <http://www.w3.org/mid/
> 4D25F22093241741BC1D0EEBC2DBB1DA0181C82C13@EX-SEA5-D.ant.amazon.com>
>
> All,
>
> Recently, the I18N Core WG raised an issue with Selectors-API and
> consequently with CSS Selectors related to Unicode Normalization.
> This note is an attempt to summarize the issue and its attendant
> history at W3C, as well as to explain the current I18N Core
> approach to this topic. I am bringing this issue to HTML-CG so that
> all the affected WGs can be aware of the debate. I realize that
> this email is very lengthy, but I think it is necessary to fully
> summarize the issue. Some have wondered about what, exactly, "those
> I18N guys want from us" and I think it is important to provide a
> clear position.
>
> At present, I don't have a fixed WG position, since the I18N WG is
> carefully reviewing its stance. However, this is an issue of
> sufficient importance and potential impact, that we felt it
> important to reach out to the broader community now. A summary
> follows:
>
> ---
>
> First, some background about Unicode:
>
> Unicode encodes a number of compatibility characters whose purpose
> is to enable interchange with legacy encodings and systems. In
> addition, Unicode uses characters called "combining marks" in a
> number of scripts to compose glyphs (visual textual units) that
> have more than one logical "piece" to them. Both of these cases
> mean that the same semantically identical "character" can be
> encoded by more than one Unicode character sequence.
>
> A trivial example would be the character 'é' (a latin small letter
> 'e' with an acute accent). It can be encoded as U+00E9 or as the
> sequence U+0065 U+0301. Unicode defines a concept called 'canonical
> equivalence' in which two strings may be said to be equal
> semantically even if they do not use the same Unicode code points
> (characters) in the same order to encode the text. It also defines
> a canonical decomposition and several normalization forms so that
> software can transform any canonically equivalent strings into the
> same code point sequences for comparison and processing purposes.
>
> Unicode also defines an additional level of decomposition, called
> 'compatibility decomposition', for which additional normalization
> forms exist. Many compatibility characters, unsurprisingly, have
> compatibility decompositions. However, characters that share a
> compatibility decomposition are not considered canonically
> equivalent. An example of this is the character U+2460 (CIRCLED
> DIGIT ONE). It has a compatibility decomposition to U+0031 (the
> digit '1'), but it is not considered canonically equivalent to the
> number '1' in a string of text.
>
> For the purposes of this document, we are discussing only canonical
> equivalence and canonical decompositions.
>
> Unicode has two particular rules about the handling of canonical
> equivalence that concern us here: C6 and C7. C6 says that
> implementations shall not assume that two canonically equivalent
> sequences are distinct [there exist reasons why one might not
> conform to C6]. C7 says that *any* process may replace a character
> sequence with its canonically equivalent sequence.
>
> Normalization @ W3C:
>
> Which brings us to XML, HTML, CSS, and so forth. I will use XML as
> the main example here, since it is the REC that most directly
> addresses these issues. The other standards mainly ignore the issue
> (they don't require normalization, but they mostly don't address it
> either).
>
> XML, like many other W3C RECs, uses the Universal Character Set
> (Unicode) as one of its basic foundations. An XML document, in
> fact, is a sequence of Unicode code points, even though the actual
> bits and bytes of a serialized XML file might use some other
> character encoding. Processing is always specified in terms of
> Unicode code points (logical characters). Like other W3C RECs, XML
> says nothing about canonical equivalence in Unicode. There exist
> recommendations to avoid compatibility characters (which includes
> some canonical equivalences) and 'Name' prevents starting the
> various named tokens with certain characters which include
> combining marks.
>
> Most implementations of XML assume that distinct code point
> sequences are actually distinct, which is not in keeping with
> Unicode C6 [1]. That is, if I define one element <!ELEMENT é
> EMPTY> and another element <!ELEMENT é EMPTY>, they are
> usually considered to be separate elements, even though both define
> an element that looks like <é/> in a document and even though any
> text process is allowed to convert one sequence into the other--
> according to Unicode. For that matter, a transcoder might produce
> either of those sequences when converting a file from a non-Unicode
> legacy encoding.
>
> One might think that this would be a serious problem. However, most
> software systems consistently use a single Unicode representation
> to represent most languages/scripts, even though multiple
> representations are theoretically possible in Unicode. This form is
> typically very similar to Unicode Normalization Form C (or "NFC"),
> in which as many combining marks as possible are combined with base
> characters to form a single code point (NFC also specifies the
> order in which combining marks that cannot be combined appear; no
> Unicode normalization form guarantees that *no* combining marks
> will appear, as some languages cannot be encoded at all except via
> the use of combining characters). As a result, few users encounter
> issues with Unicode canonical equivalence.
>
> However, some languages and their writing systems have features
> that expose or rely on canonical equivalence. For example, some
> languages make use of combining marks and the order of the
> combining marks can vary. Other languages use multiple accent marks
> and their input systems may pre-compose or not compose characters
> depending on the keyboard layout, operating system, fonts, or the
> software used to edit text. Vietnamese is an example of this. Since
> canonically equivalent text is (supposed to be) visually
> indistinguishable, users typically don't care that (for example)
> their Mac uses a different code point sequence than their
> neighbor's Windows computer. These languages are sensitive to
> canonical equivalence and rely on consistent normalization in order
> to be used with a technology such as XML. Further, many of the Ur-
> technologies are now used in combination. For example, a site might
> use XML for data interchange, XSLT to extract the data into an HTML
> page for presentation, CSS to style that page, and AJAX for the
> user to interact with the page.
>
> With this potential problem in mind, eleven (!) years ago the I18N
> WG started to work on a Specification to address the use of Unicode
> in W3C specs. This work is collectively called the "Character Model
> for the World Wide Web" or 'CharMod' for short [2]. Initially, the
> WG approach was to recommend what was termed "early uniform
> normalization" (EUN). In EUN, virtually all content and markup was
> supposed to be in a single normalization form, specifically NFC.
> The WG identified the need for both document formats (e.g. a file
> called 'something.xml') to be in NFC as well as the parsed contents
> of the document (individual elements, attributes, or content within
> the document). This was called "fully normalized" content.
>
> For this recommendation to work, tools, keyboards, operating
> environments, text editors, and so on would need to provide for
> normalizing data either on input (as with most European languages)
> or when processing, saving, or interacting with the resulting data.
> Specifications were expected to require normalization whenever
> possible. In cases where normalization wasn't required at the
> document format level, de-normalized documents would be 'valid',
> but could cause warnings to be issued in a validator or de-
> normalized content could be normalized by tools at creation time.
> The benefit to this approach was that specifications and certain
> classes of implementation could mostly assume that users mostly had
> avoided the problems with canonical equivalence by always authoring
> documents in the same way. Using this approach, there would be less
> need, for example, for CSS Selectors to consider normalization,
> since both the style sheet (or other source of the selector) and
> the document tree being matched would use the same code point
> sequence for the same canonically equivalent text. The user-agent
> could just compare strings. In the few cases where this wasn't the
> case, the user would be responsible for fixing it, but generally
> the user was responsible for carefully having constructed the issue
> in the first place (since their tools and formats would normally
> have used NFC).
>
> I18N WG worked on this approach for a number of years while
> languages with normalization risks began to develop appreciable
> computing support and a nascent Web presence. CharMod contained
> strong recommendations but not requirements (with a notable
> exception that we'll cover in a moment) towards normalization.
> Since it didn't matter so long as content remained normalized and
> since the most common languages were normalized by default,
> specifications generally didn't require any level of normalization
> (although they "should" do so), implementations generally ignored
> normalization, tools did not implement it, and so forth.
>
> There was one interesting exception to the recommendations in
> CharMod. String identity matching *required* (MUST) the use of
> normalization (requirement C312). This nod to canonical equivalence
> was also ignored by most specs, implementations, and thus content.
> It should be noted that CSS Selectors is a string identity matching
> case and not merely one of the "SHOULD" cases.
>
> From being a mostly theoretical problem, normalization has become
> something that can be demonstrated in real usage scenarios in real
> languages. While only a quite small percentage of total content is
> affected, it quite directly impacts specific languages [3]. The
> I18N WG is engaged in finding out exactly how prevalent this
> problem is. It is possible that, despite having become a real
> problem, it is still so strictly limited that it can be dealt with
> best via other means that spec-level requirements.
>
> In early 2005, the I18N WG decided that EUN as an approach was not
> tenable because many so many different system components,
> technologies, and so forth would be affected by a "requirement" to
> normalize; that some technologies (such as keyboarding systems)
> were not up to the task; and that, as a result, content would not
> be normalized uniformly. The decision was made to change from
> focusing on EUN towards a policy something like:
>
> 1. Recommend the use of normalized content ("EUN") as a "best
> practice" for content authors and implementers.
> 2. Since content might not be normalized, require specifications
> affected by normalization to address normalization explicitly.
>
> Surprisingly, none of the requirements are actually changed by this
> difference in focus. Note that this did not mean that normalization
> would be required universally; it only meant that WGs would be
> asked to consider the impact or, in some cases, to change their
> specification.
>
> In 2008, at the TPAC, I18N WG reviewed the current WD with an eye
> towards finally completing the normalization portion of the work (a
> separate working group in the Internationalization Activity has
> been chartered between 2005 and 2008 to do the work; this working
> group expired with no progress and "I18N Core" inherited the
> unfinished work). I18N's review revealed that the current document
> state was not sufficient for advising spec, content, or
> implementation authors about when and how to handle the new "late
> (r)" normalization. The same review produced general
> acknowledgement that there now existed significant need based on
> real content for normalization to be handled by W3C Specs.
>
> At the very end of 2008, I18N WG also reviewed the Selectors-API
> draft produced by WebApps. In reviewing this document, the WG noted
> that Selectors, upon which API is based, did not address
> normalization. Other recent REC-track documents had also been
> advised about normalization and had ended up requiring the use of
> NFC internally. However, in the case of Selectors-API, the
> selectors in question were in CSS3 and were in a late working draft
> state. CSS WG responded to this issue and a long thread has
> developed on our combined mail lists, in a wiki, and elsewhere.
>
> Over the past two-plus weeks, the I18N WG has solicited advice and
> comments from within its own community, from Unicode, and from the
> various CSS (style), XML, and HTML communities. We have embarked on
> a full-scale review of what position makes the most sense for the
> W3C to hold. In our most recent conference call (11 February), we
> asked members and our Unicode liaison to gather information on the
> overall scope of the problem on the Web today. We also are
> gathering information on the impact of different kinds of
> normalization recommendation. We had expected to complete our
> review at the 11 February concall, but feel we need an additional
> week.
>
> There are a few points of emerging consensus within I18N. In
> particular, if normalization is required, such a requirement
> probably could be limited to identifier tokens, markup, and other
> formal parts of document formats. Content itself should not
> generally be required to be normalized (a recommendation should
> certainly be made and normalization, of course, is always permitted
> by users or some process--see Unicode C7), in part because there
> exist use cases for de-normalized content.
>
> The other emerging consensus is that canonical equivalence needs to
> be dealt with once and for all. WGs should not have the CharMod
> sword hanging over them and implementers and content authors should
> get clear guidance. During this review, I18N is considering all
> possible positions, from "merely" making normalization a best
> practice to advocating the retrofitting of normalization to our
> core standards (as appropriate, see above).
>
> One of the oft-cited reasons why normalization should not be
> introduced is implementation performance. Assuming, for a moment,
> that documents are allowed to be canonicalized for processing
> purposes, our experience suggests that overall performance impact
> can be limited. There exist strategies for checking and normalizing
> data that are very efficient, in part owing to the relative rarity
> of denormalized data, even in the affected languages. This document
> will not attempt to outline the performance cases for or against
> normalization, except to note that performance *is* an important
> consideration and *must* be addressed.
>
> I hope this summary is helpful in discussing normalization. I want
> to raise this issue now so that all of the affected parts of the
> W3C community can consider this issue and how it affects their
> specifications/implementations/tests/tools/etc. As a consensus
> (hopefully) emerges (not just within I18N), we should be in a
> position to finally resolve the normalization conundrum best
> proceed to create a global Web that works well for all the world's
> users.
>
> Kind Regards,
>
> Addison (for I18N)
>
> [1] http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf Unicode
> Conformance
> [2] http://www.w3.org/TR/charmod-norm/ CharMod-Normalization part
> [3] http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/
> 0128.html Ishida example
> [a] http://lists.w3.org/Archives/Public/public-i18n-core/
> 2009JanMar/0182.html and another list
>
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization WG
>
> Internationalization is not a feature.
> It is an architecture.
>
>
Received on Sunday, 15 February 2009 15:02:35 UTC