- From: William A. Rowe, Jr. <wrowe@rowe-clan.net>
- Date: Tue, 16 Sep 2008 19:28:07 -0500
- To: "Phillips, Addison" <addison@amazon.com>
- CC: John Cowan <cowan@ccil.org>, "Roy T. Fielding" <fielding@gbiv.com>, Mark Nottingham <mnot@mnot.net>, URI <uri@w3.org>, Joe Gregorio <joe@bitworking.org>, David Orchard <orchard@pacificspirit.com>, Marc Hadley <Marc.Hadley@Sun.COM>
Phillips, Addison wrote:
>> This way some of Roy's observations with
>> respect
>> to a defined normalization form are honored.
> 
> I note: the current draft (-03) has a bit that concerns me in Section 4.4, where it says:
> 
>    Before substitution the template processor MUST convert every
>    variable value into a sequence of characters in ( unreserved / pct-
>    encoded ).  The template processor does that using the following
>    algorithm: The template processor normalizes the string using NFKC,
>    converts it to UTF-8 [RFC3629], and then every octet of the UTF-8
>    string that falls outside of ( unreserved ) MUST be percent-encoded,
>    as per [RFC3986], section 2.1.  For variables that are lists, the
>    above algorithm is applied to each value in the list.
> 
> It should not say "NFKC" there, I think. Form KC removes a large number of textual distinctions that are inappropriate to remove in a URI context. Form KC should be used rarely and carefully.
You are correct; NFC is appropriate here...
> It is useful to note that IRI (RFC 3987) does NOT require either Form KC or Form C in mapping from IRI->URI and does not by design. Although it is desirable to produce a consistent normalized form (typically Form C is recommended for this), there do exist cases in which normalization is not appropriate (it produces inappropriate reordering of combining marks, etc.). I think the reference in URI Templates for converting strings to URI would be best if it started from or were identical in result to the rules in IRI (see section 3.1). 
   3.1 [Step 1]...
            a. If the IRI is written on paper, read aloud, or otherwise
                represented as a sequence of characters independent of
                any character encoding, represent the IRI as a sequence
                of characters from the UCS normalized according to
                Normalization Form C (NFC, [UTR15]).
although in the other cases, the server performs no normalization, this
clause effectively states that only NFC needs to be honored.
Which means NFKC is truly out of place in the -03 draft.
Similarly, if the template processor relies on user input, it should be
subjected to NFC per RFC 3987, while if it's a machine value, the NFC
form should have been used in the first place.
Received on Wednesday, 17 September 2008 00:28:51 UTC