Thinking about fn:build-uri (PR #1388) from Norm Tovey-Walsh on 2024-09-02 (public-xslt-40@w3.org from September 2024)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Mon, 02 Sep 2024 16:35:59 +0100
To: "public-xslt-40@w3.org" <public-xslt-40@w3.org>
Message-ID: <m2cylmjcf4.fsf@saxonica.com>

Hello,

I currently have an open PR that attempts to address deficiencies identified in the behavior of fn:build-uri. It’s complicated. Christian has pushed back gently suggesting that it’s perhaps too complicated. I’m sympathetic.

But URIs *are* complicated. And the escaping rules are especially complicated because what users would like ideally is going to depend on many things, only some of which are in our control.

Let’s look at a simple use case. Suppose we have path-segments like this:

path-segments: { "", "a", "b/c" }

If we make a path out of that, we must not create “/a/b/c”. The “b/c” segment has been %-decoded as a convenience for the user (assuming it came originally from an existing URI by way of fn:parse-uri).

We have to escape “/” to “%2f”: “/a/b%2fc”. Okay.

We can’t use fn:encode-for-uri because it’s much too aggressive. There are many characters that users would expect to have unescaped in the URI (“,”, “@”, “$”, “(“, “)”, to name just a few). Applications will break if we arbitrarily escape those characters.

If you parse a URI and then rebuild it, users are going to expect it to “be the same”. We can’t guarantee that because “/a/%62%2fc” will become { "", "a", "b/c" } and we can’t know that the “b” was percent encoded. The encoding is not idempotent or generally reversible.

We might adopt a generous definition of “be the same”: that the resulting URI will not be structurally different from the original. In order to make that guarantee, there are some characters we must escape:

“%” because a % introduces a %-encoded pair
“/” because a / is a path separator in URIs
“?” because a ? delimits the query
“#” because a # delimits the fragment identifier
“+” because some URI decoders will treat “+” as a shortcut for %20

plus a few others (space, “[”, “]”, …) because they’re not generally allowed in URIs. And if we were being really consistent, we’d want to escape the path-separator character as well, probably.

So far, so good.

Unfortunately, we have to apply escaping rules to query parameters and fragment identifiers as well.

In query parameters, we need to escape “=” and “&” because they are used to delimit values. And if we’re really going to be consistent, we should probably not escape “&” literally, but instead whatever the user selects as the query-separator-character. But we don’t need to escape “/”, because we’ve left the path part of the URI behind or “?” because we’re already in the query.

This matters because someone might have used a filename in a query parameter:

query-parameters: { "fn": "/home/ndw/config.xml" }

And it’s possible the application will misinterpret:

https:// … ?fn=%2fhome%2fndw%2fconfig.xml

if it was expecting

https:// … ?fn=/home/ndw/config.xml

Finally, in the fragment identifier, we don’t need to escape “=” and “&” (the query separator), but we also don’t need to escape “/” and “?”.

This matters because

scheme:// … #test/this

might match a fragment in a document where this doesn’t:

scheme:// … #test%2fthis

A few observations:

1. My goodness this is fiddly. It’s fiddly to specify, fiddly to test, and probably fiddly to understand.

2. It will never be perfect. Whatever rules we adopt, it will be possible for someone to write, or need to use, a service that has different rules.

3. For most users, most of the time, none of these rules apply. Most URIs don’t have parameters or fragment identifiers. 99.99+% of the ones that do have simple key/value pairs consisting of keys with alphanumeric names and values that are numbers or strings that don’t need to distinguish between “/” and “%2f”. 99.999+% of fragment identifiers are just alphanumeric strings (they have to be NCNames in XML).

I’m inclined to say that we almost have this right and we should try to finish it up. But I’ve said that before and been wrong. I also think that Christian may be right. Righter than me.

As I look at this gordian knot of fiddly rules and special cases, I am tempted to reach for a sword.

Specifically this one:: all control characters (including space) are %-encoded, and all URI reserved characters are %-encoded, always. And nothing else.

As a reminder, the URI reserved characters are:

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

The downside of this approach is that it will not always produce the answer most likely to work correctly (as I understand the problem space). A “,” or “(“ or “@” in a path segment will be encoded. A “/” or “#” in a query or fragment identifier will be encoded.

Given that I already said it will *never* be perfect, this just produces a perhaps slightly different set of circumstances in which the answer isn’t perfect. The user isn’t stuck, they can write code to do the construction themselves.

The upside of this approach is that it’s a smaller, simpler set of rules and the rules are drawn directly from RFC 3986, we didn’t invent any of it.

Should I toss my PR aside and apply these rules instead?

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

Received on Monday, 2 September 2024 15:36:07 UTC