[draft] Unicode Normalization: requsts for CSS-WG, HTML-CG agendum

Draft of message for html-cg@ and www-style@. Comments please?
================
(This message is on behalf of the Internationalization Core WG)

Previously I was tasked with "pinging" the CSS-WG [1] about normalization in CSS3-Selectors due to a comment we made about Selectors-API [2]. In our last teleconference, my action was expanded to include requesting coordination at HTML-CG. 

Please read the whole email before falling out of your chair. Richard Ishida produced a potentially useful summary of the discussion at [3].

The problem with normalization in Selectors as we see it is:

1. Two canonically equivalent strings can be represented in Unicode by different code point sequences. Canonical equivalence in Unicode means that the strings are "equal" logically, even if they are encoded using different character sequences.

2. A number of languages and scripts are actually implemented in more than one way, resulting in there being more than one code point sequence used in real data for the same canonically equivalent text. The text in these cases is visually and semantically identical and is usually intended to be identical. Some languages (such as most common European languages) are difficult to produce in a non-normalized form and do not often exhibit normalization problems, while other languages are difficult to produce consistently in a normalized form. 

3. Therefore normalization has become an emergent problem for string matching in certain languages. As string matching is a key element of Selectors, the CSS WG should address it in their spec in some fashion. I18N would prefer that this take the form of requiring strings that are canonically equivalent according to Unicode to match.

4. This problem is not restricted to CSS Selectors but is a general problem. 

5. Many implementers are concerned about the impact of normalization on their existing implementation's performance and upon interoperability.

Therefore, the Internationalization WG:

  - hereby formally requests that CSS make Unicode normalization a requirement for matching in CSS3 Selectors
  - requests that this be a topic at the next HTML Coordination Group call
  - requests the support of CSS, HTML, XML, and other WGs to ensure a consistent and reasonable resolution to this problem is adopted

Please note:

  - We recognize that there are performance and implementation issues to be confronted. 

  - It is possible that the resolution of this issue would be that content authors, rather than implementations, need to be aware of and deal with it. I18N feels that it is somewhat premature to jump to this conclusion because of the difficulty users of many of these languages would have in identifying and working around the problem.

  - It is possible that normalization on read/parse of XML or other document formats is the most desirable solution. Clarification is needed on whether this is a valid thing to do and at what stage the normalization may be applied. 

  - There is disagreement about where Unicode normalization is permitted in XML in particular. Normalization of documents as part of the parsing process would reduce the impact and overhead on underlying specifications such as selectors. Non-normalization of documents at this level suggests that each specification will need to deal with normalization independently. XML 1.0 5e, QT, C14N have different levels of support for Unicode normalization and none directly address what is permitted.

  - There are some potential concerns about how the interplay of HTML, CSS, and JavaScript might be affected by parser or processor normalization. In particular, JavaScript is not "normalization aware" and can be used to write into the document's body directly.

Best Regards,

Addison

[1] http://www.w3.org/2009/01/28-core-minutes.html#ActionSummary

[2] http://lists.w3.org/Archives/Public/public-i18n-core/2008OctDec/0140.html



Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Friday, 6 February 2009 23:08:44 UTC