- From: Mark Davis ☕ <mark@macchiato.com>
- Date: Fri, 8 Apr 2011 15:34:04 -0700
- To: fantasai <fantasai.lists@inkedblade.net>
- Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>, Unicode <unicode@unicode.org>
- Message-ID: <BANLkTim3+DbGf=Fp4jzP8BEKYn+rkvJs7A@mail.gmail.com>
> without a non-lossy normalization scheme, which Unicode currently lacks Depending on what is meant, this either trivially true, trivially false, or materially false. Any sense of "non-lossy" can only be measured against what is expected to be maintained. A normalization scheme that preserves all code points and their order, is completely lossless under any measure; it is the identity operation. And Unicode has that ;-). Any other normalization scheme loses some information; that is the purpose, after all, of normalization. The question is what that information is. NFC, for example, maintains all Unicode canonical equivalences, since that is what it is measured against. That is, two strings are canonically equivalent iff their NFC forms are the same. > (NFC having been hijacked by the anti-compatibility-character crusades) That's a myth; there was no hijacking going on. What there was is a mistake early on, in categorizing the CJK compatibility characters as being canonical equivalents. That was recognized later on, but by then stability considerations prevented it from being fixed. Excluding those, there are relatively few characters (currently) that are not allowed in NFC. However, the CJK compatibility characters were themselves a rather broken approach, and a much better one has developed in the meantime, the IVD ( http://www.unicode.org/ivd/). And those sequences are maintained by NFC. Mark *— Il meglio è l’inimico del bene —* On Thu, Apr 7, 2011 at 17:11, fantasai <fantasai.lists@inkedblade.net>wrote: > There was a very very very long thread on Unicode normalization in CSS > back in January/February of 2009. IIRC the conclusion was that the > problem is much bigger than CSS, and I18n had some work yet to do to > figure it all out. > > Is that a correct recollection? > > Daniel Glazman has been collecting outstanding issues filed against > CSS Namespaces since we now have the implementations to move to PR, > and this was one of them. But I couldn't find any conclusions to the > discussion. > > I think realistically we have two options here: > 1. Nothing is normalized in CSS. > 2. CSS-internal user-defined identifiers are normalized to NFC, i.e. > - counter names > - namespace prefixes > - etc. > We already make a distinction between user-defined and CSS-defined > names in that user-defined names are case-sensitive. > http://www.w3.org/blog/CSS/2007/12/12/case_sensitivity > > Within #2 we could > - Normalize at "parse" time, i.e. before exposing such identifiers > to the CSSOM. > - In this case we need to decide whether unquoted font names are > also affected. Probably yes. > - Normalize at "match" time, i.e. store and expose the identifiers > unnormalized, but define that they represent the same thing. > > The third option would be to normalize the whole CSS file, but from > the discussions about interactions with XML, HTML, the DOM, etc. this > did not seem feasible, at least not without a non-lossy normalization > scheme, which Unicode currently lacks (NFC having been hijacked by the > anti-compatibility-character crusades). > > So I guess the question is, what's the right way forward here? > > ~fantasai > >
Received on Friday, 8 April 2011 22:34:32 UTC