RE: reviewing draft-weber-iri-guidelines-00 from Phillips, Addison on 2011-07-06 (public-iri@w3.org from July 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 6 Jul 2011 07:46:20 -0700
To: Chris Weber <chris@lookout.net>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A94126E82@EX-SEA31-D.ant.amazon.com>

>   For parsing and interpreting the IRI MUST NOT be normalized, but for
> equivalence testing NFC MUST be used but only in the exception cases like a
> conversion from non-UCS to UCS...  However, it does go on to suggest that
> when IRIs are created they should be normalized with NFC or NFKC.  Do you
> agree with that?

There may be application specific cases in which KC is useful, but I don't think the IRI document should recommend NFKC for general purpose use in an IRI. Too few developers understand the distinction.

I *do* agree with the recommendation that IRI "authors" create IRIs in NFC. Doing so eliminates problems that users might otherwise encounter by ensuring a consistent representation. It's important that, if IRI processing does not include a normalization step, users be aware of how to avoid problems that diverse-but-canonically-equivalent Unicode code point sequences create.

> 
> > Also:
> >
> >     4.  Replace each entity references with its corresponding character.
> >
> >
> > This can't be done until /after/ the fields of an IRI are parsed out.
> > Example: in a path, you don't want an escaped / or # or ? to be
> > transformed until after you've parsed out the path.
> >
> 
> Ah of course, that would be a big problem :)
> 
See my previous comment, though. You have to unescape the string if you're going to normalize it, since the escaped characters could affect the strings normalization state. Again, I would suggest (if you don't remove the normalization) that the unescaping be limited to iunreserved, which avoids this problem.

Addison

Received on Wednesday, 6 July 2011 14:46:45 UTC