RE: [draft] Unicode Normalization: requsts for CSS-WG, HTML-CG agendum from Phillips, Addison on 2009-02-10 (public-i18n-core@w3.org from January to March 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 10 Feb 2009 13:20:50 -0800
To: Mark Davis <mark.davis@icu-project.org>, fantasai <fantasai.lists@inkedblade.net>
CC: Martin Duerst <duerst@it.aoyama.ac.jp>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA017DE684CA@EX-SEA5-D.ant.amazon.com>

… which is why NFC isn’t actually harmful/lossy from a Unicode perspective.

The two “lossy” cases cited were:


-          Ideographic variations where the user wants a specific ideographic variation.

-          Singletons. These are (mostly) characters that sometimes exist for compatibility reasons (conversion from a legacy encoding, usually), but for which another character is strongly recommended.

I’m fine with using these characters in content (the characters exist to be used), but document identifiers are a different story. Appendix J in XML 1.0 5e recommends a variety of things regarding proper definition of “Name” tokens, one of which happens to be NFC (it also explicitly covers IVDs and a few other things). What we’re asking for here, in CSS Selectors/selection in general, is to increase the strength of the warning by allowing normalization of Name tokens for matching purposes. This is in keeping with the idea that these documents are meant to be sequences of Unicode characters and subject to the Unicode model of text (where canonical equivalence is, well, “equivalent”).

If we don’t do this, then users and implementers must assume that normalization will not be done for them at any point. This is not the end of the world (actually, it is the state of the world). But it is the end of CharMod as we’ve known and, uh, ur, “loved” it these many years.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On Behalf Of Mark Davis
Sent: Tuesday, February 10, 2009 1:02 PM
To: fantasai
Cc: Martin Duerst; Phillips, Addison; public-i18n-core@w3.org
Subject: Re: [draft] Unicode Normalization: requsts for CSS-WG, HTML-CG agendum

The CJK compatibility characters are also variants of the corresponding 'ordinary' character in that either character could appear in either form. As a matter of fact, the glyphic shape of the sources (eg JIS) has changed over time.

The Unicode Consortium does recognize that particular glyphic shapes are sometimes important, and has developed a much more comprehensive mechanism to deal with it. See http://unicode.org/reports/tr37/


Mark

On Mon, Feb 9, 2009 at 16:16, fantasai <fantasai.lists@inkedblade.net<mailto:fantasai.lists@inkedblade.net>> wrote:

Martin Duerst wrote:
I haven't read everything, but if your claim ("overly-aggressive")
is true, then early normalization would be better than late matching,
because it would allow those producers that, for whatever reason,
insist on that there is a difference to simply not do normalization
for these codepoints.

The argument is that certain normalization mappings in NFC/NFD
are more like the types of mappings that happen in NFKC/NFKD than
like the compose/decompose/ordering mappings. Therefore early
normalization would cause dataloss in the content, whereas late
matching at, e.g. the selectors level, would avoid such dataloss
while still allowing such strings to match.

See Ambrose Li's and Robert Burns's comments:
http://lists.w3.org/Archives/Public/www-style/2009Feb/0229.html


~fantasai

Received on Tuesday, 10 February 2009 21:21:31 UTC