RE: URI Templates: done or dead? from Phillips, Addison on 2008-09-16 (uri@w3.org from September 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 16 Sep 2008 16:38:17 -0700
To: "William A. Rowe, Jr." <wrowe@rowe-clan.net>, John Cowan <cowan@ccil.org>
CC: "Roy T. Fielding" <fielding@gbiv.com>, Mark Nottingham <mnot@mnot.net>, URI <uri@w3.org>, Joe Gregorio <joe@bitworking.org>, David Orchard <orchard@pacificspirit.com>, Marc Hadley <Marc.Hadley@Sun.COM>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014BE2CA73@EX-SEA5-D.ant.amazon.com>

> This way some of Roy's observations with
> respect
> to a defined normalization form are honored.

I note: the current draft (-03) has a bit that concerns me in Section 4.4, where it says:

   Before substitution the template processor MUST convert every
   variable value into a sequence of characters in ( unreserved / pct-
   encoded ).  The template processor does that using the following
   algorithm: The template processor normalizes the string using NFKC,
   converts it to UTF-8 [RFC3629], and then every octet of the UTF-8
   string that falls outside of ( unreserved ) MUST be percent-encoded,
   as per [RFC3986], section 2.1.  For variables that are lists, the
   above algorithm is applied to each value in the list.

It should not say "NFKC" there, I think. Form KC removes a large number of textual distinctions that are inappropriate to remove in a URI context. Form KC should be used rarely and carefully.

It is useful to note that IRI (RFC 3987) does NOT require either Form KC or Form C in mapping from IRI->URI and does not by design. Although it is desirable to produce a consistent normalized form (typically Form C is recommended for this), there do exist cases in which normalization is not appropriate (it produces inappropriate reordering of combining marks, etc.). I think the reference in URI Templates for converting strings to URI would be best if it started from or were identical in result to the rules in IRI (see section 3.1). 

I should note that IRI deals with encoding a complete IRI to a complete URI. URI Templates insert text into URIs and this introduces the additional complexity of include-normalization. See http://www.w3.org/TR/charmod-norm/#sec-IncludeNormalized. 

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

> -----Original Message-----
> From: William A. Rowe, Jr. [mailto:wrowe@rowe-clan.net]
> Sent: Tuesday, September 16, 2008 4:08 PM
> To: John Cowan
> Cc: Phillips, Addison; Roy T. Fielding; Mark Nottingham; URI; Joe
> Gregorio; David Orchard; Marc Hadley
> Subject: Re: URI Templates: done or dead?
> 
> John Cowan wrote:
> > Phillips, Addison scripsit:
> >
> >> We have pretty good knowledge of what makes a good Unicode
> >> identifier. If we're going to assign variable names in a new
> pattern
> >> language, why are we limiting it to alphanum? The software we
> are
> >> linking to (the part generating the variables that get
> substituted in)
> >> may not--indeed probably does not--have that same limitation.
> >
> > Given that URIs are ASCII-only, I think it is sufficient to have
> > identifiers be ASCII-only too.
> 
> Actually, I thought they were opaque bytestreams wrapped in ASCII,
> e.g.
> %80 or %FF in a URI should be valid in the resource path, no?
> 
> I'm wondering why templates don't consider implementation in terms
> of
> RFC 3987, or at least ensure IRI compatibility, for protocols or
> use
> cases which desire it.  This way some of Roy's observations with
> respect
> to a defined normalization form are honored.
> 
> I'm unconcerned with the variable names being i18n, the application
> author determines these.  It's their values that ultimately concern
> me :)
>

Received on Tuesday, 16 September 2008 23:39:07 UTC