W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: [draft] Unicode Normalization: requsts for CSS-WG, HTML-CG agendum

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Sun, 08 Feb 2009 17:11:25 +0900
Message-Id: <>
To: "Phillips, Addison" <addison@amazon.com>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>

I haven't had time to read all the messages in the recent theads,
but I have to disagree with your conclusions.

First, I don't think that after having worked on this issue for
more than 10 years (for the I18N WG, it started at the WWW Conference
in Brisbane in 1998), with much meandering of the WG itself
(from a strong focus on early normalization initially to the
stress on very complex definitions such as "fully normalized"
for top security to the recent apparent trend for "late matching"),
without much concrete results (convincing other WGs and implementers),
but also without much concrete damage, we can suddenly claim that
this is now urgently needed, and get everybody convinced.

Second, I think that if we want to send a message like the one
below, we have to be much more specific about actually affected
languages and other details.

Third, I don't believe that late matching as proposed below is
the right solution. It's much easier to say that content should
be in NFC. Where it's relevant, the right tools will emerge.
Editors that are currently not able to output NFC will be
changed to do so, validating and fixup tools (even Web-based)
will emerge if there is a need, and so on. It can be done
one-by-one, and has a higher guarantee for success than
trying to convince every HTML, CSS, XML,... tool in the
market to do late matching.

Regards,    Martin.

At 08:08 09/02/07, Phillips, Addison wrote:
>Draft of message for html-cg@ and www-style@. Comments please?
>(This message is on behalf of the Internationalization Core WG)
>Previously I was tasked with "pinging" the CSS-WG [1] about normalization 
>in CSS3-Selectors due to a comment we made about Selectors-API [2]. In our 
>last teleconference, my action was expanded to include requesting 
>coordination at HTML-CG. 
>Please read the whole email before falling out of your chair. Richard 
>Ishida produced a potentially useful summary of the discussion at [3].

[3] seems to be missing.

>The problem with normalization in Selectors as we see it is:
>1. Two canonically equivalent strings can be represented in Unicode by 
>different code point sequences. Canonical equivalence in Unicode means that 
>the strings are "equal" logically, even if they are encoded using different 
>character sequences.
>2. A number of languages and scripts are actually implemented in more than 
>one way,

Very confusing.

>resulting in there being more than one code point sequence used in 
>real data for the same canonically equivalent text. The text in these cases 
>is visually and semantically identical and is usually intended to be 
>identical. Some languages (such as most common European languages) are 
>difficult to produce in a non-normalized form and do not often exhibit 
>normalization problems, while other languages are difficult to produce 
>consistently in a normalized form. 
>3. Therefore normalization has become an emergent problem for string 
>matching in certain languages. As string matching is a key element of 
>Selectors, the CSS WG should address it in their spec in some fashion. I18N 
>would prefer that this take the form of requiring strings that are 
>canonically equivalent according to Unicode to match.
>4. This problem is not restricted to CSS Selectors but is a general problem. 
>5. Many implementers are concerned about the impact of normalization on 
>their existing implementation's performance and upon interoperability.
>Therefore, the Internationalization WG:
>  - hereby formally requests that CSS make Unicode normalization a 
>requirement for matching in CSS3 Selectors
>  - requests that this be a topic at the next HTML Coordination Group call
>  - requests the support of CSS, HTML, XML, and other WGs to ensure a 
>consistent and reasonable resolution to this problem is adopted
>Please note:
>  - We recognize that there are performance and implementation issues to be 
>  - It is possible that the resolution of this issue would be that content 
>authors, rather than implementations, need to be aware of and deal with it. 
>I18N feels that it is somewhat premature to jump to this conclusion because 
>of the difficulty users of many of these languages would have in 
>identifying and working around the problem.
>  - It is possible that normalization on read/parse of XML or other 
>document formats is the most desirable solution. Clarification is needed on 
>whether this is a valid thing to do and at what stage the normalization may 
>be applied. 
>  - There is disagreement about where Unicode normalization is permitted in 
>XML in particular. Normalization of documents as part of the parsing 
>process would reduce the impact and overhead on underlying specifications 
>such as selectors. Non-normalization of documents at this level suggests 
>that each specification will need to deal with normalization independently. 
>XML 1.0 5e, QT, C14N have different levels of support for Unicode 
>normalization and none directly address what is permitted.
>  - There are some potential concerns about how the interplay of HTML, CSS, 
>and JavaScript might be affected by parser or processor normalization. In 
>particular, JavaScript is not "normalization aware" and can be used to 
>write into the document's body directly.
>Best Regards,
>[1] http://www.w3.org/2009/01/28-core-minutes.html#ActionSummary
>[2] http://lists.w3.org/Archives/Public/public-i18n-core/2008OctDec/0140.html
>Addison Phillips
>Globalization Architect -- Lab126
>Chair -- W3C Internationalization WG
>Internationalization is not a feature.
>It is an architecture.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     
Received on Sunday, 8 February 2009 09:18:27 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:23:04 UTC