Re: [WebCGM2.1][LC Review] i18n comment 6: Unicode normalization

Hello Lofton,

Thanks for the note on WebCGM 2.1. This response is on behalf of the Internationalization Core WG [0].

The Internationalization Core WG generally recommends using Unicode Normalization Form C (NFC) for normalization-sensitive operations such as string comparison. While this isn't always the right choice, it appears to us that it makes the most sense for font name matching for these reasons:

 - Most files and font names will probably already use NFC, so the need to actually normalize strings will be reduced. (Checking normalization is faster and easier than performing it) 
 - Any file that uses ISO 8859-1 (Latin-1) as its encoding, for example, is already in NFC.
 - NFC is generally considered a non-destructive normalization, unlike the compatibility forms NFKC and NFKD.

Please note that case-insensitive comparison is not addressed by Unicode normalization.

For specific information on normalization, you can reference both the Unicode Standard Annex #15 [1] and the W3C Character Model, Part 2 (Normalization) [2]. The latter is still a working draft and is being revised currently. Please contact us on public-i18n-core@ if you have additional questions or concerns. We'd be happy to work with you to resolve this issue appropriately.

Best Regards (for I18N Core),

Addison

[0] http://www.w3.org/2008/12/17-core-minutes.html 
[1] http://www.unicode.org/reports/tr15/

[2] http://www.w3.org/TR/charmod-norm/ 

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

===
-----Original Message-----
From: public-i18n-core-request@w3.org [mailto:public-i18n-core-request@w3.org] On Behalf Of Lofton Henderson
Sent: Wednesday, December 03, 2008 10:58 AM
To: ishida@w3.org
Cc: public-webcgm-wg@w3.org; public-i18n-core@w3.org
Subject: Re: [WebCGM2.1][LC Review] i18n comment 6: Unicode normalization


Hello, and thanks for the helpful I18N comments on the WebCGM 2.1 Last Call 
review.

After some research into the details of Comment #6 -- that WebCGM should 
use a Unicode normalization form for font-name-string comparisons -- we see 
the wisdom of it for reliable matching.  But lacking deep expertise on the 
topic, we'd welcome further advice.

Question:  Do you have a recommendation for which of the four normalization 
forms would be best?

For background, recall that the subject string comparison is seeking a 
match between:  on the one hand, a font-name-string as extracted from a 
WebCGM instance; and on the other hand, a font-name-string from the ACL 
file (a separate XML file) that specifies the font-name to be matched.

We would expect Unicode normalization to potentially make a difference in 
those cases wherein the first string (font-name from WebCGM instance) is 
outside the well-defined core set of thirteen (13) fixed names that are 
required by the WebCGM standard.  The character encoding in the WebCGM 
instance will be either ISOLatin1, or Unicode UTF8 or UTF16.

If the answer is not simple enough for efficient email resolution, we would 
welcome your participation in our teleconference of Thursday, 04-dec, 11am 
EST.  (Or alternately two weeks later if you can't make tomorrow.)  Please 
let me know, and I will send telecon logistics.

Thanks,
-Lofton Henderson
(Chair WebCGM WG)


At 10:29 AM 11/11/2008 +0000, ishida@w3.org wrote:

>Comment from the i18n review of:
>http://www.w3.org/TR/2008/WD-webcgm21-20080917/WebCGM21-Config.html#ACI-fontmap


>
>Comment 6
>At http://www.w3.org/International/reviews/0811-webcgm/


>Editorial/substantive: S
>Tracked by: RI
>
>Location in reviewed document:
>9.3.2.2 
>[http://www.w3.org/TR/2008/WD-webcgm21-20080917/WebCGM21-Config.html#ACI-maplist]
>
>Comment:
>Normalization for string comparison should include conversion to a Unicode 
>normalization form, to eliminate issues related to precomposed vs. 
>decomposed characters and issues related to ordering of multiple combining 
>characters.
>
>

Received on Wednesday, 17 December 2008 23:27:19 UTC