- From: Joe Gregorio <joe@bitworking.org>
- Date: Fri, 22 Dec 2006 13:49:29 -0500
- To: uri@w3.org
We have several open issues: 1. Deciding which characters to escape. 2. Reserving some character in template variable names for future use, ala ':' for XML namespaces. While this is a long post, I will only cover the issues involved in #1. My over-arching goal of URI-Templates, and I believe this is necessary to make them a success, it to make URI Templates simple by being opinionated, as Sam described it. http://lists.w3.org/Archives/Public/uri/2006Oct/0043 == Grounding == First let's dispel the notion that you can come up with the perfect URI-Template to URI translation mechanism that will always produce a valid URI regardless of the scheme. That last part, "regardless of the scheme", is the crux of the problem. While RFC 3986 defines what a URI looks like, schemes may impose further restrictions. For example, while tel:bitworking.org matches the ABNF in RFC 3986, it is not a valid tel: URI, and it never will be. We have two choices: 1. Define a mechanism that is only guaranteed to meet the UR syntax (i.e. RFC 3986), and thus potentially generate URIs that are invalid in some schemes. 2. Restrict ourselves to URIs of a particular scheme such as http: or mailto:. Just for reference here is a set of example URIs from RFC 3986: ftp://ftp.is.co.za/rfc/rfc1808.txt http://www.ietf.org/rfc/rfc2396.txt ldap://[2001:db8::7]/c=GB?objectClass?one mailto:John.Doe@example.com news:comp.infosystems.www.servers.unix tel:+1-816-555-1212 telnet://192.0.2.16:80/ urn:oasis:names:specification:docbook:dtd:xml:4.1.2 == Serenedipity == As an aside, it turns out that the regular expression given in Appendix B of RFC 3986 is completely capable of parsing up URI Templates, but only if the characters allowed in template variable names are restricted, and only if template variables are not allowed to span components. Here is a Python implementation that uses that regular expression: URI = re.compile(r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?") def parse_uri(uri): """Parses a URI using the regex given in Appendix B of RFC 3986. (scheme, authority, path, query, fragment) = parse_uri(uri) """ groups = URI.match(uri).groups() return (groups[1], groups[3], groups[4], groups[6], groups[8]) And if we run that over the example URIs with templated parts added in: print parse_uri("http://{server}/rfc/rfc2396.txt") print parse_uri("ftp://ftp.is.co.za/{dir}/rfc1808.txt") print parse_uri("ldap://[2001:db8::7]/c={country}?objectClass?one") print parse_uri("mailto:{addr}") print parse_uri("news:comp.infosystems.www.servers.{server}") print parse_uri("tel:+{number}") print parse_uri("telnet://192.0.2.16:{port}/") print parse_uri("urn:oasis:names:specification:docbook:dtd:{version}") We get: ('http', '{server}', '/rfc/rfc2396.txt', None, None) ('ftp', 'ftp.is.co.za', '/{dir}/rfc1808.txt', None, None) ('ldap', '[2001:db8::7]', '/c={country}', 'objectClass?one', None) ('mailto', None, '{addr}', None, None) ('news', None, 'comp.infosystems.www.servers.{server}', None, None) ('tel', None, '+{number}', None, None) ('telnet', '192.0.2.16:{port}', '/', None, None) ('urn', None, 'oasis:names:specification:docbook:dtd:{version}', None, None) This is important because it makes it easy to parse up a URI Template *if* we want to impose different escaping requirements on different components. == What to %-encode == Certain characters are going to have to be %-encoded to ensure that filling in a URI-Template with the values doesn't destroy the structure of the URI. For both URIs and IRIs the 'reserved' set of characters are the ones that are going to cause trouble and need to be escaped. reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" Each part of an IRI has its own acceptable chars: scheme = ALPHA / DIGIT / "+" / "-" / "." iauthority = ipchar ipath = ipchar / "/" iquery = ipchar / iprivate / "/" / "?" ifragment = ipchar / "/" / "?" where: ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@" The rules are the same for URIs, except drop all the 'i's off the beginning of the names, and drop iprivate. So let's begin with a simple approach, how about escaping all the characters in 'reserved'? If we do, then you *can't* do this: http://example.org?{fred} fred="q=2" expands to: http://example.org?q=2 That might seem too restrictive, so let's make that example concrete. http://www.google.com/search?q={term} term="Ben&Jerrys" If reserved characters are escaped then the URI Template expands to: http://www.google.com/search?q=Ben%26Jerrys That search gives you the results you would expect. If reserved characters are NOT escaped then you get a very different search result: http://www.google.com/search?q=Ben&Jerrys And that does *not* give the expected results. So let's always escape? Not so fast. If we always escape reserved characters we get mailto:{address} address="joe@bitworking.org" expanding to mailto:joe%40bitworking.org which is *not* what you want to happen. Like I said, we can't come up with something guaranteed to generate only valid URIs unless we restrict ourselves to a particular scheme, which isn't as useful as defining templates for all URIs. So what if we pick a subset of 'reserved' that does not get %-encoded? Can we pick a subset that produces the least surprising results? Here is my suggestion, to escape all the characters in 'reserved' except the following three: '@' / ':' / '/' The above subset seems to generate the 'least suprising' results: Our Ben&Jerrys query to Google still works. The mailto: example works. Http paths also work as expected: http://bitworking.org/{path} path="projects/httplib2/" http://bitworking.org/projects/httplib2/ Like I said, it's not perfect: http://{sub}.example.org/index.html sub="a/b" http://a/b.example.org/index.html Which is clearly an invalid URI. So do we give special escaping rules for authority? That at least makes the results match the URI syntax, but for the HTTP scheme the string a%2Fb.example.org isn't a valid domain name. And don't even get me started on how this could go bad if you allowed template variables in the scheme: {scheme}://bitworking.org scheme="gopher" gopher://bitworking.org On the other hand, I could see useful applications: http{ssl}://bitworking.org ssl="s" https://bitworking.org So we have a few possibilities: 1. Escape all 'reserved' characters except @, :, and / across every component, realizing we may not end up with a valid URI. 2. Escape all 'reserved' characters except @, and :, realizing that our 'path' example will then break since '/' will get escaped. 3. Escape all 'reserved' characters except @, :, and /, but only allow template variables in path, query and fragment components. == IRIs == Just as another aside, I am no longer afraid of IRIs. == The Algorithm == Let's start with IRIs since those are actually simpler, and let's also assume that we choose #1 of the options above: 1. Escape all 'reserved' characters except @, :, and / across every component, realizing we may not end up with a valid URI. Algorithm: 1. Start with an IRI Template: http://example.org/{blah} 2. Percent encode every char in the values of the template variables that aren't in ( iprivate | iunreserved | '@' | ':' | '/' ) 3. Substitute variables with their values, which produces an IRI. Note that we could use the same algorithm for URI Templates as long as we add a fourth step: 4. Convert the IRI to a URI following Section 3.1 of RFC 3987. Hopefully this has been helpful in highlighting some of the subtle issues in character handling that need to be more strictly specified in the next revision of the spec. Thanks, -joe -- Joe Gregorio http://bitworking.org
Received on Friday, 22 December 2006 18:49:46 UTC