Escaping characters in fn:build-uri

Hello,

In the course of discussion on issue #566, I’ve discovered that the specification for fn:build-uri mandates escaping path segments with fn:encode-for-uri. This has the expository advantage of being an easy cross-reference. Unfortunately, it’s a bit too aggressive.

Consider http://example.com/path%2fto/some=where

When that’s parsed by fn:parse-uri, we get the following segments:

  ("", "path/to", "some=where")

Decoding the characters in the original path when constructing the segments is convenient for users. If we then attempt to reconstruct the original URI with fn:build-uri, we get:

  https://example.com/path%2Fto/some%3Dwhere

Which isn’t the same URI. (This is just an example of one character that’s encoded where we’d prefer it wasn’t, the test suite includes many examples of “:” and other characters that would be too-aggressively encoded by fn:encode-for-uri.)

It’s worth noting that in general terms, decoding %-escaped characters in a URI is not reversible. If you started with http://example.com/path%2fto/some%3dwhere, you’d get the same segments as the example above.

It’s also worth noting that we’ve already agreed that the purpose of these functions is to make the most common cases easiest for users. There are already aspects of the URI that aren’t preserved: the order of query parameters, for example.

It’s always possible to do the decoding yourself if what these functions provide isn’t sufficient for your application. But we should be trying to make that as uncommon as we practically can.

What RFC 3986 says, what my implementation currently does, what fn:encode-for-uri does, and what might be best for the common case, is a complete tangle at the moment.

If we can’t use fn:encode-for-uri, and I don’t think we can, then we should try to make the rules as simple as we can. In the overwhelming majority of cases for most users, they’re parsing hierarchical URIs. (I bet that http(s): and file: URIs comprise 90+% of the common cases. And 100% for many users.)

In a hierarchical URI the following characters are special:

% - introduces a percent escape
/ - divides parts of the path
? - delimits the start of query parameters
# - delimits the fragment identifier
+ - a common way to encode a single space
[ - introduces an IPv6 address
] - terminates an IPv6 address

My proposal is to revise fn:build-uri so that the only ASCII characters that are encoded are the control characters, the space character, and the seven characters listed above. Other Unicode characters are encoded per the IRI spec (by turning them into UTF-8 octets, etc.); I’m not proposing to change that.

Thoughts?

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Tuesday, 19 March 2024 14:10:54 UTC