RE: uri templates: NFKC or NFC from Sampo Syreeni on 2011-07-15 (uri@w3.org from July 2011)

From: Sampo Syreeni <decoy@iki.fi>
Date: Fri, 15 Jul 2011 08:45:09 +0300 (EEST)
To: "Phillips, Addison" <addison@lab126.com>
cc: "Roy T. Fielding" <fielding@gbiv.com>, URI <uri@w3.org>
Message-ID: <Pine.LNX.4.64.1107150747410.11191@lakka.kapsi.fi>
On 2011-07-14, Phillips, Addison wrote:

> NFKC destroys some real semantic differences (whereas NFC is generally 
> considered fairly benign).

Unicode characters are not supposed to carry any semantics beyond what 
is encoded in them by the standard(s). Thus, canonical equivalence means 
any two characters which are related by it are exactly the same. If 
they're handled in any way differently from each other, anywhere, the 
implementation is by definition not Unicode/ISO 10646 compatible. The 
different compliance levels kind of muck up this basic idea, true, but 
this is how it was meant to be.

As for compatibility equivalence, it's basically an interim measure and 
a concession to existing character codings which do carry meaning, and 
roundtripping between Unicode and existing, stupider encodings. It's not 
something you should espouse when working primarily in Unicode, but 
something you should do away with in lieu of explicit tagging. In fact 
most of the time you should just drop the difference altogether without 
any further tagging and treat compatibility equivalent characters as the 
same. But if you really, really can't, you should still compatibility 
decompose and move the semantics onto a higher level protocol, like HTML 
or whatnot.

As such, in the end, what Unicode is supposed to be like in its pure 
form is what follows from putting everything into NFKD. Without 
exception, and also raising an exception for illformed character 
encoding every time you see something that is not in compliance. If you 
need anything beyond that, you're supposed to relegate that to some 
higher level protocol, while flat out refusing to input and output 
anything that isn't formally and verifiably in NFKD (i.e. in True 
Unicode).

> It could even introduce some visual oddities, such as the character 
> U+00BC (vulgar fraction one quarter) becoming the sequence "1/4" 
> (albeit the / is not %2F it is U+2044 FRACTION SLASH)

That is then by design: that sort of thing isn't part of the character 
model, but about how characters might be used as part of some higher 
level protocol/syntax. Such as MathML or whatnot. Fractions and the like 
do not belong in Unicode, and the only reason they have been allowed 
into it is as an interim blemish, hopefully soon to go away for good.

If NFKD leads to "visual oddities", it's because your software for some 
reason doesn't implement the proper higher level protocol correctly, 
and/or misunderstands what Unicode is about.

> [...] The main reason to have normalization for templates would appear 
> to me to be the normalization of character sequences in a variable 
> name. [...]

To me it seems there is a definite disconnect between how the 
Unicode/ISO folks think about the character model, and how it is being 
utilized in practice. If the original intent behind the character model 
was the real aim, we wouldn't have these sorts of discussions in the 
first place. We'd only wonder about how to deal with NFKD, with its 
unorthodox, open-ended, suffix form. It could then be tackled purely by 
technical means, without these kinds of policy debates, even if it lead 
to some rather nasty string parsing code in the process.

> It might be better to just handle sequences that don't match as not 
> matching (e.g. the user is responsible for normalization) or perhaps 
> referencing UAX#31 http://www.unicode.org/reports/tr31/ on what makes 
> a valid identifier. Note that normalization does not eliminate the 
> potential for problems such as combining marks to start a sequence.

Such things are ungrammatical wrt Unicode, so I'd say just fail 
gracefully on them. After that, either a) fail for any NFKD violation in 
either comparand and after that for any bitwise or lengthwise mismatch, 
or (more usually) b) always normalize to strict, formal NFKD and fail 
upon the first unmatched bit and/or string length mismatch, then. That's 
how it's supposed to work, it's easier/cheaper to implement than most 
alternatives, and as a matter of fact it already shields you from half 
of the homonym attacks which things like stringprep try to defend 
against. Not to mention all of the Unicode specific attacks...
-- 
Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Friday, 15 July 2011 05:45:53 UTC