URI and IRI Templating (What did I get myself into?) from Joe Gregorio on 2006-12-22 (uri@w3.org from December 2006)

From: Joe Gregorio <joe@bitworking.org>
Date: Fri, 22 Dec 2006 13:49:29 -0500
To: uri@w3.org
Message-ID: <3f1451f50612221049p24858d1ak280260ee6a9730df@mail.gmail.com>
We have several open issues:

1. Deciding which characters to escape.
2. Reserving some character in template variable names
    for future use, ala ':' for XML namespaces.

While this is a long post, I will only cover the issues involved in #1.

My over-arching goal of URI-Templates, and I believe this is
necessary to make them a success, it to make URI Templates
simple by being opinionated, as Sam described it.

   http://lists.w3.org/Archives/Public/uri/2006Oct/0043

== Grounding ==

First let's dispel the notion that you can come up with
the perfect URI-Template to URI translation mechanism
that will always produce a valid URI regardless of the
scheme. That last part, "regardless of the scheme", is the
crux of the problem. While RFC 3986 defines what a
URI looks like, schemes may impose further restrictions. For
example, while

   tel:bitworking.org

matches the ABNF in RFC 3986, it is not a valid tel: URI,
and it never will be.

We have two choices:

1. Define a mechanism that is only guaranteed to meet the UR
    syntax (i.e. RFC 3986), and thus potentially generate
    URIs that are invalid in some schemes.
2. Restrict ourselves to URIs of a particular scheme such
   as http: or mailto:.

Just for reference here is a set of example URIs from RFC 3986:

ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2

== Serenedipity ==

As an aside, it turns out that the regular expression given in
Appendix B of RFC 3986 is completely capable of
parsing up URI Templates, but only if the characters
allowed in template variable names are restricted, and
only if template variables are not allowed to span
components.

Here is a Python implementation that uses that regular expression:

URI = re.compile(r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?")

def parse_uri(uri):
    """Parses a URI using the regex given in Appendix B of RFC 3986.

        (scheme, authority, path, query, fragment) = parse_uri(uri)
    """
    groups = URI.match(uri).groups()
    return (groups[1], groups[3], groups[4], groups[6], groups[8])

And if we run that over the example URIs with
templated parts added in:

print parse_uri("http://{server}/rfc/rfc2396.txt")
print parse_uri("ftp://ftp.is.co.za/{dir}/rfc1808.txt")
print parse_uri("ldap://[2001:db8::7]/c={country}?objectClass?one")
print parse_uri("mailto:{addr}")
print parse_uri("news:comp.infosystems.www.servers.{server}")
print parse_uri("tel:+{number}")
print parse_uri("telnet://192.0.2.16:{port}/")
print parse_uri("urn:oasis:names:specification:docbook:dtd:{version}")

We get:

('http', '{server}', '/rfc/rfc2396.txt', None, None)
('ftp', 'ftp.is.co.za', '/{dir}/rfc1808.txt', None, None)
('ldap', '[2001:db8::7]', '/c={country}', 'objectClass?one', None)
('mailto', None, '{addr}', None, None)
('news', None, 'comp.infosystems.www.servers.{server}', None, None)
('tel', None, '+{number}', None, None)
('telnet', '192.0.2.16:{port}', '/', None, None)
('urn', None, 'oasis:names:specification:docbook:dtd:{version}', None, None)

This is important because it makes it easy to parse up
a URI Template *if* we want to impose
different escaping requirements on different components.

== What to %-encode ==

Certain characters are going to have to be %-encoded
to ensure that filling in a URI-Template with the values
doesn't destroy the structure of the URI. For both
URIs and IRIs the 'reserved' set of characters are the
ones that are going to cause trouble and need to be
escaped.

   reserved       = gen-delims / sub-delims
   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

Each part of an IRI has its own acceptable chars:

     scheme         = ALPHA / DIGIT / "+" / "-" / "."
     iauthority     = ipchar
     ipath          = ipchar  / "/"
     iquery         = ipchar / iprivate / "/" / "?"
     ifragment      = ipchar / "/" / "?"

where:

     ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@"

The rules are the same for URIs, except drop
all the 'i's off the beginning of the names, and drop iprivate.

So let's begin with a simple approach, how about
escaping all the characters in 'reserved'?

If we do, then you *can't* do this:

   http://example.org?{fred}
   fred="q=2"

expands to:

   http://example.org?q=2

That might seem too restrictive, so let's make
that example concrete.

   http://www.google.com/search?q={term}

   term="Ben&Jerrys"

If reserved characters are escaped then
the URI Template expands to:

   http://www.google.com/search?q=Ben%26Jerrys

That search gives you the results you
would expect. If reserved characters are NOT escaped then you
get a very different search result:

   http://www.google.com/search?q=Ben&Jerrys

And that does *not* give the expected results.

So let's always escape? Not so fast. If
we always escape reserved characters we get

   mailto:{address}
   address="joe@bitworking.org"

expanding to

   mailto:joe%40bitworking.org

which is *not* what you want to happen.

Like I said, we can't come up with something guaranteed to
generate only valid URIs unless we restrict ourselves to a particular
scheme, which isn't as useful as defining templates for all URIs.
So what if we pick a subset of 'reserved' that does not
get %-encoded? Can we pick a subset that produces
the least surprising results? Here is my suggestion, to escape
all the characters in 'reserved' except the following three:

  '@' / ':' / '/'

The above subset seems to generate the 'least suprising' results:

Our Ben&Jerrys query to Google still works.
The mailto: example works.
Http paths also work as expected:

   http://bitworking.org/{path}
   path="projects/httplib2/"

   http://bitworking.org/projects/httplib2/

Like I said, it's not perfect:

   http://{sub}.example.org/index.html
   sub="a/b"

   http://a/b.example.org/index.html

Which is clearly an invalid URI. So do we
give special escaping rules for authority?
That at least makes the results match the
URI syntax, but for the HTTP scheme the string
a%2Fb.example.org isn't a valid domain name.
And don't even get me started on how this could go
bad if you allowed template variables in the
scheme:

   {scheme}://bitworking.org
   scheme="gopher"

   gopher://bitworking.org

On the other hand, I could see useful
applications:

   http{ssl}://bitworking.org
   ssl="s"

   https://bitworking.org

So we have a few possibilities:

1. Escape all 'reserved' characters except @, :, and /
    across every component, realizing
    we may not end up with a valid URI.
2. Escape all 'reserved' characters except @, and :,
    realizing that our 'path' example
    will then break since '/' will get escaped.
3. Escape all 'reserved' characters except @, :, and /,
    but only allow template variables in path, query and
    fragment components.

== IRIs ==

Just as another aside, I am no longer afraid of IRIs.

== The Algorithm ==

Let's start with IRIs since those are actually simpler, and let's
also assume that we choose #1 of the options above:

1. Escape all 'reserved' characters except @, :, and /
    across every component, realizing
    we may not end up with a valid URI.


Algorithm:

   1. Start with an IRI Template:

       http://example.org/{blah}

   2. Percent encode every char in the values
       of the template variables that aren't in

             ( iprivate | iunreserved | '@' | ':' | '/' )

   3. Substitute variables with their values, which produces an IRI.

Note that we could use the same algorithm for URI Templates
as long as we add a fourth step:

   4. Convert the IRI to a URI following Section 3.1 of RFC 3987.

Hopefully this has been helpful in highlighting
some of the subtle issues in character handling
that need to be more strictly specified in the
next revision of the spec.

   Thanks,
   -joe

-- 
Joe Gregorio        http://bitworking.org
Received on Friday, 22 December 2006 18:49:46 UTC