URI Templates, percent-encoding, bnfs and working code from Joe Gregorio on 2007-10-13 (uri@w3.org from October 2007)

From: Joe Gregorio <joe@bitworking.org>
Date: Fri, 12 Oct 2007 22:22:51 -0400
To: uri@w3.org
Message-ID: <3f1451f50710121922p6fbb5393qc2d66c7c555e2ea0@mail.gmail.com>
Warning: LONG, but by the end I get to real life code, so stick with me.

That last draft for URI Templates was published back in July
and to be honest I haven't been very happy with it. We've struggled with
percent-encoding of reserved characters for a very long time
and latest draft doesn't really resolve the problem.

   During substitution, the string value of a template variable MUST
   have any characters that do not match the reserved or unreserved
   rules (i.e., those characters not legal in URIs without percent
   encoding) percent-encoded, as per [RFC3986], section 2.1.  Specific
   applications of URI Templates MAY specify additional constraints and
   encoding rules in addition to this.

This is unsatisfactory for a lot of reasons, mostly related to how functional
the spec actually is. There is a large set of cases I think URI Templates can
be used for and I don't think the very simple templating mechanism defined
covers nearly enough cases. I also think that the story around
percent-encoding is hopelessly mired. For example, here are some examples
that I hope show that the current {var} system is inadequate:
Here is a URI Template for my own blog:

    URI Template
        http://bitworking.org/news/{entry}
    Template Variable(s)
        entry := 'RESTLog_Specification'
    URI
        http://bitworking.org/news/RESTLog_Specification

The problem is that a year ago I changed the URI structure for new
posts going forward, while old posts keep the same structure. Here
is an example from a newer entry:

    URI Template
        http://bitworking.org/news/{entry}
    Template Variable(s)
        entry := '240/Newsqueak'
    URI
        http://bitworking.org/news/240/Newsqueak

On the other hand, if I wanted to search for this post on
Technorati I would want:

    URI Template
        http://technorati.com/search/{term}
    Template Variable(s)
       term := '240/Newsqueak'
    URI
        http://technorati.com/search/240%2FNewsqueak

Right off the bat we have a problem with percent-encoding
and reserved characters. Here are some more examples
with '&' and '=' characters:

   URI Template
         http://www.google.com/search?q={term}
    Template Variable(s)
        term := ben&jerrys
    URI
        http://www.google.com/search?q=ben&jerrys

Failing to percent-encode will get you the wrong results.

   URI Template
        http://www.google.com/search?q={term}
    Template Variable(s)
         term := 2+2=5
    URI
         http://www.google.com/search?q=2%2B2%3D5

    URI Template
        http://www.google.com/base/feeds/snippets/?{author}
    Template Variables
         author := author=joe.gregorio@gmail.com
    URI
       http://www.google.com/base/feeds/snippets/?author%3djoe.gregorio%40gmail.com


From the above you can see how not percent-encoding or over
percent-encoding can change the meaning of a URI. I'm sure we could
all construct pathological cases where any reserved character
should and should not be percent-encoded.

Not only is the percent-encoding not a solved problem, there
are moderately complex cases that we can't approach at all.

For example, in this fairly common search:

    http://www.google.com/search?q={term}&num={n}

the query parameter num is optional, how do we show that?

In addition, what about this combination:

    http://www.google.com/search?q={term}
    term := Îñţérñåţîöñåļîžåţîöñ

What character encoding do we use?

And here is an even more complex example from
GData. Paths of some URIs are actually boolean
logic filters on the categories of elements
returned in the feed:

    URI Template
        /-/{category}/
    Category Rules
        OR      fred|barney
        AND     fred/barney
        NOT     -fred
    Example
        /-/A%7C-B/-C   means   (A OR (NOT B)) AND (NOT C)

So I hope I've given enough examples to show that the
cause is completely hopeless at least for the simple
{var} expansion.

Roy suggested a bash-inspired substitution rule set:

{=default:variable}
   If variable is defined and non-empty, then substitute the
   value of variable. Otherwise, substitute with the default value:

   E.g., {=red:favoritecolor} = "value" or "red"

{?prefix:variable}
   If variable is defined and non-empty, then substitute the
   string of non-colon characters between the '?' and ':', if
   any, followed by the value of variable. Otherwise,
   substitute with the empty string.
E.g.,
   {?/:variable} = "/value" or ""
   {?;name=:variable} = ";name=value" or ""
   {?#:variable} = "#value" or ""

Now notice how this expansion adds reserved characters
into the URI. So I took Roy's expansions and tried an experiment:

   What if the *only* way to get a reserved character into
   the final URI was through templating expansions?

This points out the crux of the problem and the
solution. Einstein said "Make everything as simple as possible, but
not simpler",
and the current templating spec is a case of trying to make something
too simple.
The percent-encoding is a complexity, and like a lump in
the rug, you push it down in one place and it will pop up
in another. The simple {var} syntax is too simple to handle
what we are trying to do. So let's expand the power
of these expansions and see if that makes the problems go away.

So what does this experiment look like? Let's be as draconian as possible:

   1. Convert Unicode template values to UTF-8.
   2. Percent-encode all characters outside unreserved.

We can keep our simple {var} expansion, but let's add in
a default value:

    {var=default}
         Simple substitution

    Example:

    URI Template
       http://example.org/{fruit=orange}/
    Template Var
       fruit = "apple"
    URI
       http://example.org/apple

    URI Template
        http://example.org/{fruit=orange}/
    Template Var
        fruit is undefined
    URI
        http://example.org/orange

The rest of the expansions I'll define follow this form:


    {<op><arg>|<variable(s)>}


    <op> - A single character not in unreserved.
    <arg> - Any legal URI character.
    <variables> - May be more than one, may have defaults.
                   Default values must already be percent-encoded.
                   Variable names are from unreserved.



    {<prefix|var[=default]}
         Prefix var with prefix, emit empty string if
         var is empty or undefined.

    URI Template
        bar{</|var}/
    Template Var
        var := foo
    URI
        bar/foo/


    {<postfix|var[=default]}
         Append var with postfix, emit empty string if
         var is empty or undefined.

    URI Template
        bar/{>#home|var}
    Template Var
        var := foo
    URI
        bar/foo#home


    {,sep|var1=def1, var2=def2, ...}
           Substitute the concatenation of variable name,
           "=", variable value. Join more than one var by the value
           of 'sep'.

    URI Template
        {,&|name,location,age}
    Template Var
        name := joe
        location := NYC
    URI
         name=joe&location=NYC


     {&sep|var}
          Treat var as a list and join the values in the list
          with the given separator. Emit empty string if var
          is empty or undefined
     URI Template
          {&/|segments}
     Template Var
          segments := ["a", "b", "c"]
     URI
          a/b/c



    {?opt|var}
          Inserts opt if var is a string or non-zero length list.

   URI Template
        {?/|segments}
   Template Var
       segments := ["a", "b", "c"]
   URI
        /

     {!opt|var}
          Inserts opt if var is undefined or a zero length list.

     URI Template
          {!/|segments}
     Template Var
          segments := ["a", "b", "c"]
     URI
           ""


Does it work? Let's try all our previous examples:

My blog template works out fine:

      http://bitworking.org/news/{entry}
      entry := '240/Newsqueak'
      http://bitworking.org/news/240/Newsqueak

All the searches do too:

     http://www.google.com/search?q={term}
     term := ben&jerrys
     http://www.google.com/search?q=ben%26jerrys

     http://www.google.com/search?q={term}
     term := 2+2=5
     http://www.google.com/search?q=2%2B2%3D5

     http://www.google.com/base/feeds/snippets/?{,&|author}
     author := joe.gregorio@gmail.com
     http://www.google.com/base/feeds/snippets/?author%3djoe.gregorio%40gmail.com

In the example from Google search, all variable names in the {,}
expansion are optional, i.e.
none of those variables need be defined.

     http://www.google.com/search?q={,&|term,num}

Internationaliztion is also covered:

     http://www.google.com/search?q={term}
     term := Îñţérñåţîöñåļîžåţîöñ
      http://www.google.com/search?q=%C3%8E%C3%B1%C5%A3%C3%A9r%C3%B1%C3%A5%C5%A3%C3%AE%C3%B6%C3%B1%C3%A5%C4%BC%C3%AE%C5%BE%C3%A5%C5%A3%C3%AE%C3%B6%C3%B1

Note that to handle the complex category query from GData we
need to use two expansions: '?' and '|'.

     URI Template
           {?/-/:categories}{|/:categories}
     Template Vars
           categories = ["A|-B", "-C"]
     URI
             /-/A%7C-B/-C

To see multiple expansions working together at
the same time let's look at a URI from the Google
Notebook GData service<http://code.google.com/apis/notebook/reference.html>:

   http://www.google.com/notebook/feeds/
   {userID}{</notebooks/|notebookID}
   {?/-/|categories}{&/|categories}?
   {,&:updated-min,updated-max,alt,
       start-index,max-results,entryID,orderby}

So this system is pretty capable without crossing over into
the realm of Turing Complete. On the other hand, it is not
without fault:

   1. Doesn't handle repeated query parameters.
   2. Doesn't specify if variables are mandatory or optional.
   3. Doesn't handle encodings besides UTF-8.
   4. Template language is complex, cryptic.
   5. No handling of input validation, enums, ranges, etc.
   6. Possible to define a self-inconsistent URI Template:
         1. {&|fred}{<#|fred}
   7. Prefixes and suffixes are redundant, as
        they could be handled by using the '?' expansion.
   8. Comma expansions could have two strings, one to separate
        name-value pairs (as now), the other to separate names from
        values (now hard-coded to "=").
   9. Sensible defaults need to be invented to deal with parameter values
       that are lists when not expected to be (or are not lists when
expected to be) (see #6).
  10. No specification for how to handle IRIs beyond "Turn an IRI Template
       into a URI Template and then proceed."
  11. Need way to say "Insert this if some/none of these variables exist"
        to strip trailing "?" from URIs with no parameters.


To those of you thinking to yourself, "I'd understand that much
better as a BNF", here you go:

  token arg        '[^\|]*';
  token varname    '[\w\-\.~]+' ;
  token vardefault '[\w\-\.\~\%]+' ;

  START    -> Template;
  Template ->
       IdentityOperatorTemplate
     | OperatorTemplate
     ;
  IdentityOperatorTemplate/ -> Var
  OperatorTemplate/o         ->
       '>'  arg '\|' Var
     | '<'  arg '\|' Var
     | ','  arg '\|' Vars
     | '\&' arg '\|' VarNoDefault
     | '\?' arg '\|' VarNoDefault
     | '!'  arg '\|'  VarNoDefault
     ;
  Vars                     -> Var ( ',' Var ) * ;
  Var                      -> varname ( '=' vardefault ) ? ;
  VarNoDefault         -> varname

And finally to those of you thinking to yourself, "that would be so
much better as working code", I present:

  http://code.google.com/p/uri-templates/

A Python implementation, with unit tests, requires 'tpg', the Toy
Parser Generator.

>>> import template_parser
>>> t = template_parser.URITemplate("http://www.google.com/notebook/feeds/{userID}{</notebooks/|notebookID}{?/-/|categories}{&/|categories}?{,&|updated-min,updated-max,alt,start-index,max-results,entryID,orderby}")
>>> t.sub({})
'http://www.google.com/notebook/feeds/?'
>>> t.sub({"userID": "joe.gregorio"})
'http://www.google.com/notebook/feeds/joe.gregorio?'
>>> t.sub({"userID": "joe.gregorio", "notebookID": "foo"})
'http://www.google.com/notebook/feeds/joe.gregorio/notebooks/foo?'
>>> t.sub({"userID": "joe.gregorio", "notebookID": "foo", "start-index": "20"})
'http://www.google.com/notebook/feeds/joe.gregorio/notebooks/foo?start-index=20'
>>>


So it's clear, I don't believe this is a final or complete solution, but I think
it's a good start and at least proves that expansions are a viable
solution to the percent-encoding issue.

    Thanks,
    -joe

-- 
Joe Gregorio        http://bitworking.org
Received on Saturday, 13 October 2007 02:23:01 UTC