- From: Sampo Syreeni <decoy@iki.fi>
- Date: Fri, 15 Jul 2011 08:45:09 +0300 (EEST)
- To: "Phillips, Addison" <addison@lab126.com>
- cc: "Roy T. Fielding" <fielding@gbiv.com>, URI <uri@w3.org>
On 2011-07-14, Phillips, Addison wrote: > NFKC destroys some real semantic differences (whereas NFC is generally > considered fairly benign). Unicode characters are not supposed to carry any semantics beyond what is encoded in them by the standard(s). Thus, canonical equivalence means any two characters which are related by it are exactly the same. If they're handled in any way differently from each other, anywhere, the implementation is by definition not Unicode/ISO 10646 compatible. The different compliance levels kind of muck up this basic idea, true, but this is how it was meant to be. As for compatibility equivalence, it's basically an interim measure and a concession to existing character codings which do carry meaning, and roundtripping between Unicode and existing, stupider encodings. It's not something you should espouse when working primarily in Unicode, but something you should do away with in lieu of explicit tagging. In fact most of the time you should just drop the difference altogether without any further tagging and treat compatibility equivalent characters as the same. But if you really, really can't, you should still compatibility decompose and move the semantics onto a higher level protocol, like HTML or whatnot. As such, in the end, what Unicode is supposed to be like in its pure form is what follows from putting everything into NFKD. Without exception, and also raising an exception for illformed character encoding every time you see something that is not in compliance. If you need anything beyond that, you're supposed to relegate that to some higher level protocol, while flat out refusing to input and output anything that isn't formally and verifiably in NFKD (i.e. in True Unicode). > It could even introduce some visual oddities, such as the character > U+00BC (vulgar fraction one quarter) becoming the sequence "1/4" > (albeit the / is not %2F it is U+2044 FRACTION SLASH) That is then by design: that sort of thing isn't part of the character model, but about how characters might be used as part of some higher level protocol/syntax. Such as MathML or whatnot. Fractions and the like do not belong in Unicode, and the only reason they have been allowed into it is as an interim blemish, hopefully soon to go away for good. If NFKD leads to "visual oddities", it's because your software for some reason doesn't implement the proper higher level protocol correctly, and/or misunderstands what Unicode is about. > [...] The main reason to have normalization for templates would appear > to me to be the normalization of character sequences in a variable > name. [...] To me it seems there is a definite disconnect between how the Unicode/ISO folks think about the character model, and how it is being utilized in practice. If the original intent behind the character model was the real aim, we wouldn't have these sorts of discussions in the first place. We'd only wonder about how to deal with NFKD, with its unorthodox, open-ended, suffix form. It could then be tackled purely by technical means, without these kinds of policy debates, even if it lead to some rather nasty string parsing code in the process. > It might be better to just handle sequences that don't match as not > matching (e.g. the user is responsible for normalization) or perhaps > referencing UAX#31 http://www.unicode.org/reports/tr31/ on what makes > a valid identifier. Note that normalization does not eliminate the > potential for problems such as combining marks to start a sequence. Such things are ungrammatical wrt Unicode, so I'd say just fail gracefully on them. After that, either a) fail for any NFKD violation in either comparand and after that for any bitwise or lengthwise mismatch, or (more usually) b) always normalize to strict, formal NFKD and fail upon the first unmatched bit and/or string length mismatch, then. That's how it's supposed to work, it's easier/cheaper to implement than most alternatives, and as a matter of fact it already shields you from half of the homonym attacks which things like stringprep try to defend against. Not to mention all of the Unicode specific attacks... -- Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front +358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Friday, 15 July 2011 05:45:53 UTC