- From: Phillips, Addison <addison@lab126.com>
- Date: Wed, 6 Jul 2011 07:46:20 -0700
- To: Chris Weber <chris@lookout.net>, "public-iri@w3.org" <public-iri@w3.org>
> For parsing and interpreting the IRI MUST NOT be normalized, but for > equivalence testing NFC MUST be used but only in the exception cases like a > conversion from non-UCS to UCS... However, it does go on to suggest that > when IRIs are created they should be normalized with NFC or NFKC. Do you > agree with that? There may be application specific cases in which KC is useful, but I don't think the IRI document should recommend NFKC for general purpose use in an IRI. Too few developers understand the distinction. I *do* agree with the recommendation that IRI "authors" create IRIs in NFC. Doing so eliminates problems that users might otherwise encounter by ensuring a consistent representation. It's important that, if IRI processing does not include a normalization step, users be aware of how to avoid problems that diverse-but-canonically-equivalent Unicode code point sequences create. > > > Also: > > > > 4. Replace each entity references with its corresponding character. > > > > > > This can't be done until /after/ the fields of an IRI are parsed out. > > Example: in a path, you don't want an escaped / or # or ? to be > > transformed until after you've parsed out the path. > > > > Ah of course, that would be a big problem :) > See my previous comment, though. You have to unescape the string if you're going to normalize it, since the escaped characters could affect the strings normalization state. Again, I would suggest (if you don't remove the normalization) that the unescaping be limited to iunreserved, which avoids this problem. Addison
Received on Wednesday, 6 July 2011 14:46:45 UTC