Re: URI and IRI Templating (What did I get myself into?) from Benjamin Carlyle on 2007-01-01 (uri@w3.org from January 2007)

From: Benjamin Carlyle <benjamincarlyle@optusnet.com.au>
Date: Tue, 02 Jan 2007 08:45:36 +1000
To: uri@w3.org
Message-Id: <1167691537.6338.78.camel@localhost.localdomain>
Sorry about coming into this late, but...

Joe Gregorio wrote:
> 1. Escape all 'reserved' characters except @, :, and /
>    across every component, realizing
>    we may not end up with a valid URI.
> 2. Escape all 'reserved' characters except @, and :,
>    realizing that our 'path' example
>    will then break since '/' will get escaped.
> 3. Escape all 'reserved' characters except @, :, and /,
>    but only allow template variables in path, query and
>    fragment components.

4. Require/allow the context to perform any necessary escaping, eg by
requiring appropiate javascript functions to have been called on the
parameter values

5. Require/allow the template to specify any necessary escaping

This specification is at an interesting point in the uri construction
chain. Normally a url to be either captured whole, or built up from
parts. Whichever part of the uri parameters are inserted into defines
the escaping that needs to occur.

http://example.com/query?a={b} where b="d&e=f" should escape "&" and "="
if b is going to be used as the value of a. If b is just a regular part
of the query component, however, escaping these characters may be
inappropriate. For example, http://example.com/query?{b} might be used
to substitute a whole query component. http://example.com{b} might be
used to substitute a path and query component.

The problem of course is that the client does not know the intent of the
template producer. It is probably not a good idea for the client to
guess, which leaves explicit direction as part of the template or a
general rule that covers 80% of useful cases. As mentioned by Joe
earlier in the thread, the server could specify the character encoding
style using a limited vocabulary. It might otherwise be possible to list
either an explicit set of characters to encode or an explicit set of
characters that were safe. Something like
http://example.com/query?a={b:&=#} might specify that "&" and "=" need
escaping in addition to normal escaping for characters in a query
+fragment component. http://example.com/query?a={b:query:&=} might make
the "this is a substitution for the query component" clearer while still
specifying the additional characters.

What you would essentially be looking for is a language or a vocabulary
to indicate what part of the url is being substituted by this variable
in the template. This would be straightforward for components such as
"scheme", "path", or "query" and may be able to be implied by context.
However, uris may have forms of domain-specific construction that cannot
easily be expressed in a singular vocabulary. This would require a
mechanism for specifying additional constraints once blanket rules and
vocabulary run out.

Starting out with a blanket rule and seeing whether problems emerge in
practice is probably the best idea. If problems do emerge, however, it
may be worth keeping a language for identifying the part of the uri
being substituted in the back of your mind. I'm suspicious that for my
usage a blanket rule won't cover all of my use cases.

Mark Nottingham wrote:
> Your proposal puts the encoding information into the variable name.  
> That's one option, but I'm reluctant to encourage putting this kind  
> of thing in there, as it encourages URI Templates to become URI  
> Schemas, and they'll quickly become unreadable. Encoding is by no  
> means the last thing we'll want to associate with a particular variable.

Are you talking about a separate document associated with a template
that fills out additional information that might be of use?

Jerome Louvel wrote:
> 5. Don't escape any character, leaving this task to the
>     application converting the template to valid URIs.
> My preference goes for #5.

Leaving all substitution to the client is a tempting alternative. It was
at the top of my list on my first pass at this. A client context such as
a javascript environment is likely to already have appropriate
capabilities to perform any encoding. However the client won't know
which part of the url it is filling out. It can't make sound judgements
without additional information. If the client knew enough to judge
soundly, it wouldn't need a uri template in the first place.

> As an aside, it turns out that the regular expression given in
> Appendix B of RFC 3986 is completely capable of
> parsing up URI Templates, but only if the characters
> allowed in template variable names are restricted, and
> only if template variables are not allowed to span
> components.

I'm in two minds about this. It's a potentially useful feature to be
able to do generic uri parsing on a template. However, I don't think it
is important enough to make sure we keep the feature. It could be used
to identify the characters to be escaped if we found a template
parameter within a particular component. However this could be done by
running the regex on a version of the url which had the identifiers
stripped out. In general, I think that parsing will happen on actual
urls rather than url templates. As such, it will probably be more useful
to allow a fair range of expressiveness in the identifier to allow
little javascript invocations and the like to be included.

On the other hand, just recognising a URL in context may be sufficient
reason to require url-like content in the identifier. For example, if
the identifier contained a space character it may be difficult to pick
where the template ended. This suggests to me that at least some
restrictions should be applied.

Benjamin
Received on Tuesday, 2 January 2007 14:42:13 UTC