- From: Robert J Burns <rob@robburns.com>
- Date: Wed, 11 Feb 2009 13:50:35 -0600
- To: W3C Style List <www-style@w3.org>, public-i18n-core@w3.org
Hi Bjoern, Bjoern Hoehrmann <derhoermi@gmx.net> wrote: > As far as CSS Selectors go, they largely state that strings are to be > compared using some collation defined by the language of the document. > Used on an XML document, if the XML specification says NFC(Björn) and > NFD(Björn) are different IDs, then there is no basis for Selectors to > match otherwise. I'd like to offer some clarification. This discussion is mostly focussed on string matching as a binary operation: either match or not match. Therefore the discussion is not really related to collation of strings (some of which may be canonically equivalent), but only on their exact canonical match (so language-specific collation is not involved). More importantly, the XML specification does not say that canonically equivalent identifiers should be treated as opaque streams of bytes. Rather they must be treated as Unicode strings, with all the implications of the Unicode text processing model that entails. So while there is some ambiguity in these specifications, no one has yet offered a convincing argument that suggests why Unicode in XML may (let alone should or must) drop canonical equivalence. Instead I would say that any W3C recommendation that normatively references Unicode brings with it a presumption that canonically equivalent strings are an exact match. This means that some form of normalization is needed at the consumer side to ensure such strings match. Anne's repeated suggestion that authors might be relying on two different representations of the same canonically equivalent string in existing implementations is completely untenable. Leaving the issue of singletons aside, there is no visual nor logical basis for using two identifiers that only differ in the order of their combining marks or in the precomposed or decomposed characters used in the character sequence. We're not talking about case differences such as Björn and björn where there's at least some visual distinction, but instead Björn and Björn. Now why would an author ever intend the latter as distinct identifiers? (note that my email application normalizes on paste, but my text editors – where most identifier coding takes place – do not, even though both applications use the Cocoa Text System as their base text handling engine. so there's a lot of variability here that cannot simply be wished away). > As such this discussion is largely misplaced on www- > style. Except that CSS, since it normatively references Unicode, also needs to normalize its character stream (just like XML and HTML4, HTML5, etc). Take care, Rob
Received on Wednesday, 11 February 2009 19:51:19 UTC