Re: Unicode Normalization thread should slow down; summary needed from Robert J Burns on 2009-02-11 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Wed, 11 Feb 2009 13:50:35 -0600
To: W3C Style List <www-style@w3.org>, public-i18n-core@w3.org
Message-Id: <D080E6FF-3F13-4467-BFFA-A8709AA37762@robburns.com>

Hi Bjoern,

Bjoern Hoehrmann <derhoermi@gmx.net>  wrote:
> As far as CSS Selectors go, they largely state that strings are to be
> compared using some collation defined by the language of the document.
> Used on an XML document, if the XML specification says NFC(Björn) and
> NFD(Björn) are different IDs, then there is no basis for Selectors to
> match otherwise.

I'd like to offer some clarification. This discussion is mostly  
focussed on string matching as a binary operation: either match or not  
match. Therefore the discussion is not really related to collation of  
strings (some of which may be canonically equivalent), but only on  
their exact canonical match (so language-specific collation is not  
involved). More importantly, the XML specification does not say that  
canonically equivalent identifiers should be treated as opaque streams  
of bytes. Rather they must be treated as Unicode strings, with all the  
implications of the Unicode text processing model that entails. So  
while there is some ambiguity in these specifications, no one has yet  
offered a convincing argument that suggests why Unicode in XML may  
(let alone should or must) drop canonical equivalence.

Instead I would say that any W3C recommendation that normatively  
references Unicode brings with it a presumption that canonically  
equivalent strings are an exact match. This means that some form of  
normalization is needed at the consumer side to ensure such strings  
match.

Anne's repeated suggestion that authors might be relying on two  
different representations of the same canonically equivalent string in  
existing implementations is completely untenable. Leaving the issue of  
singletons aside, there is no visual nor logical basis for using two  
identifiers that only differ in the order of their combining marks or  
in the precomposed or decomposed characters used in the character  
sequence.  We're not talking about case differences such as Björn and  
björn where there's at least some visual distinction, but instead  
Björn and Björn. Now why would an author ever intend the latter as  
distinct identifiers? (note that my email application normalizes on  
paste, but my text editors – where most identifier coding takes place  
– do not, even though both applications use the Cocoa Text System as  
their base text handling engine. so there's a lot of variability here  
that cannot simply be wished away).

> As such this discussion is largely misplaced on www-
> style.

Except that CSS, since it normatively references Unicode, also needs  
to normalize its character stream (just like XML and HTML4, HTML5, etc).

Take care,
Rob

Received on Wednesday, 11 February 2009 19:51:23 UTC