Re: Unicode normalization in CSS from Mark Davis ☕ on 2011-04-08 (public-i18n-core@w3.org from April to June 2011)

From: Mark Davis ☕ <mark@macchiato.com>
Date: Fri, 8 Apr 2011 15:34:04 -0700
To: fantasai <fantasai.lists@inkedblade.net>
Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>, Unicode <unicode@unicode.org>
Message-ID: <BANLkTim3+DbGf=Fp4jzP8BEKYn+rkvJs7A@mail.gmail.com>

> without a non-lossy normalization scheme, which Unicode currently lacks

Depending on what is meant, this either trivially true, trivially false, or
materially false. Any sense of "non-lossy" can only be measured against what
is expected to be maintained. A normalization scheme that preserves all code
points and their order, is completely lossless under any measure; it is the
identity operation. And Unicode has that ;-).

Any other normalization scheme loses some information; that is the purpose,
after all, of normalization. The question is what that information is. NFC,
for example, maintains all Unicode canonical equivalences, since that is
what it is measured against. That is, two strings are canonically equivalent
iff their NFC forms are the same.

> (NFC having been hijacked by the anti-compatibility-character crusades)

That's a myth; there was no hijacking going on. What there was is a mistake
early on, in categorizing the CJK compatibility characters as being
canonical equivalents. That was recognized later on, but by then stability
considerations prevented it from being fixed. Excluding those, there are
relatively few characters (currently) that are not allowed in NFC.

However, the CJK compatibility characters were themselves a rather broken
approach, and a much better one has developed in the meantime, the IVD (
http://www.unicode.org/ivd/). And those sequences are maintained by NFC.

Mark

*— Il meglio è l’inimico del bene —*

On Thu, Apr 7, 2011 at 17:11, fantasai <fantasai.lists@inkedblade.net>wrote:

> There was a very very very long thread on Unicode normalization in CSS
> back in January/February of 2009. IIRC the conclusion was that the
> problem is much bigger than CSS, and I18n had some work yet to do to
> figure it all out.
>
> Is that a correct recollection?
>
> Daniel Glazman has been collecting outstanding issues filed against
> CSS Namespaces since we now have the implementations to move to PR,
> and this was one of them. But I couldn't find any conclusions to the
> discussion.
>
> I think realistically we have two options here:
>  1. Nothing is normalized in CSS.
>  2. CSS-internal user-defined identifiers are normalized to NFC, i.e.
>       - counter names
>       - namespace prefixes
>       - etc.
>     We already make a distinction between user-defined and CSS-defined
>     names in that user-defined names are case-sensitive.
>     http://www.w3.org/blog/CSS/2007/12/12/case_sensitivity
>
> Within #2 we could
>  - Normalize at "parse" time, i.e. before exposing such identifiers
>    to the CSSOM.
>      - In this case we need to decide whether unquoted font names are
>        also affected. Probably yes.
>  - Normalize at "match" time, i.e. store and expose the identifiers
>    unnormalized, but define that they represent the same thing.
>
> The third option would be to normalize the whole CSS file, but from
> the discussions about interactions with XML, HTML, the DOM, etc. this
> did not seem feasible, at least not without a non-lossy normalization
> scheme, which Unicode currently lacks (NFC having been hijacked by the
> anti-compatibility-character crusades).
>
> So I guess the question is, what's the right way forward here?
>
> ~fantasai
>
>

Received on Friday, 8 April 2011 22:36:51 UTC