W3C home > Mailing lists > Public > www-international@w3.org > April to June 2006

Unicode normalization in DOM

From: Simon Montagu <smontagu@smontagu.org>
Date: Thu, 01 Jun 2006 07:57:32 +0200
Message-ID: <447E81CC.5080004@smontagu.org>
To: www-international@w3.org

I am trying to understand the practical implications of the "Character
Model Normalization" document, with particular reference to web browsers
and DOM interfaces.

http://www.w3.org/TR/2005/WD-charmod-norm-20051027/#C302 says:

|A text-processing component that receives suspect text  MUST NOT
|perform any normalization-sensitive operations unless it has first
|either confirmed through inspection that the text is in normalized form
|or it has re-normalized the text itself. Private agreements MAY,
|however, be created within private systems which are not subject to
|these rules, but any externally observable results MUST  be the same as
|if the rules had been obeyed.

I wrote a testcase based on my understanding of this paragraph:
http://smontagu.org/testcases/normalizationTest.html

The testcase uses 5 different forms of the text "ngữ", using different
combinations and ordering of "u", U+01B0 (LATIN SMALL LETTER U WITH
HORN), U+0303 (COMBINING TILDE), U+169 (LATIN SMALL LETTER WITH TILDE),
U+031B (COMBINING HORN), and U+1EEF (LATIN SMALL LETTER U WITH HORN AND
TILDE). Taking the examples of "normalization-sensitive operations" from
http://www.w3.org/TR/2005/WD-charmod-norm-20051027/#def-normalization-sensitive,
I tested counting the number of characters, deleting the last character
and string comparisons.

My understanding of C302 is that in all cases, the number of characters
should be 3, deleting the last character should give "ng", and comparing
any of the strings to any of the others should find them equal.

I also tested creating a URL query string from the different forms of
the text. Here, the browser is the producer of the query, so (by C312),
it MUST perform full normalization.

No browser that I tested (Firefox, IE6, Konquerer, Opera) performs
normalization in any of the testcases. I realize that "CharNorm" is a
Working Draft and it's early days to expect compliance, but are my
assumptions at least correct in theory?

Simon Montagu
Mozilla i18n
Received on Thursday, 1 June 2006 04:51:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:08 GMT