Re: Thinking about fn:build-uri (PR #1388)

While there is no doubt in the need for URL API, so far I do not see the
match to existing URLs , which is expected from mature frameworks or
languages with a complimentary frame.

Norm, if we start from business requirements which would materialize in
acceptance criteria, your doubts would vanish.

The analysis of existing approaches in other frames as in the XML world as
outside should be a starter part of systematic approach.

What URL/URIs are used for?
* *For comparison to figuring out the "uniqueness" *By definition that
is a *unique
resource locator/identifier*. RFC 1738(URL)
<https://datatracker.ietf.org/doc/html/rfc1738>, RFC 3986(URI)
<https://datatracker.ietf.org/doc/html/rfc3986>. The best so far explainer
of URL composition by Mozilla
<https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_URL>
.

* *For resolving*. The standards define the match of encoded and partial
variations to "final", which is the subject for uniqueness/distinctiveness.
* *For denomination of URL/URI parts*. Whether it is a *schema*, *domain*,
*path*, *query*, *hash* - each(!!!) needs capability to

   - get it from URL
   - split on parts, not just to ordered collection but also with semantic
   tagging

* *For working with URL parts*

   - CRUD operate with part, and parts set (i.e. subset substitution)
   - during oprations ^^ apply en- & de-coding

Given above business requirements would lay a base for AC of *generic API*
(whole URL and across all URL parts from domain to hash).

The industry developed some of that. None are perfect or even complete. URL
in JavaScript <https://developer.mozilla.org/en-US/docs/Web/API/URL>, URL
in Java <https://docs.oracle.com/javase/8/docs/api/java/net/URL.html>, in
python, etc. are lacking as complete set of functionality as coherent API
for treating URL parts.

I hope the XML community is worthy of thoughtful API design and
implementation.
-s




On Mon, Sep 2, 2024 at 8:36 AM Norm Tovey-Walsh <norm@saxonica.com> wrote:

> Hello,
>
> I currently have an open PR that attempts to address deficiencies
> identified in the behavior of fn:build-uri. It’s complicated. Christian has
> pushed back gently suggesting that it’s perhaps too complicated. I’m
> sympathetic.
>
> But URIs *are* complicated. And the escaping rules are especially
> complicated because what users would like ideally is going to depend on
> many things, only some of which are in our control.
>
> Let’s look at a simple use case. Suppose we have path-segments like this:
>
>   path-segments: { "", "a", "b/c" }
>
> If we make a path out of that, we must not create “/a/b/c”. The “b/c”
> segment has been %-decoded as a convenience for the user (assuming it came
> originally from an existing URI by way of fn:parse-uri).
>
> We have to escape “/” to “%2f”: “/a/b%2fc”. Okay.
>
> We can’t use fn:encode-for-uri because it’s much too aggressive. There are
> many characters that users would expect to have unescaped in the URI (“,”,
> “@”, “$”, “(“, “)”, to name just a few). Applications will break if we
> arbitrarily escape those characters.
>
> If you parse a URI and then rebuild it, users are going to expect it to
> “be the same”. We can’t guarantee that because “/a/%62%2fc” will become {
> "", "a", "b/c" } and we can’t know that the “b” was percent encoded. The
> encoding is not idempotent or generally reversible.
>
> We might adopt a generous definition of “be the same”: that the resulting
> URI will not be structurally different from the original. In order to make
> that guarantee, there are some characters we must escape:
>
> “%” because a % introduces a %-encoded pair
> “/” because a / is a path separator in URIs
> “?” because a ? delimits the query
> “#” because a # delimits the fragment identifier
> “+” because some URI decoders will treat “+” as a shortcut for %20
>
> plus a few others (space, “[”, “]”, …) because they’re not generally
> allowed in URIs. And if we were being really consistent, we’d want to
> escape the path-separator character as well, probably.
>
> So far, so good.
>
> Unfortunately, we have to apply escaping rules to query parameters and
> fragment identifiers as well.
>
> In query parameters, we need to escape “=” and “&” because they are used
> to delimit values. And if we’re really going to be consistent, we should
> probably not escape “&” literally, but instead whatever the user selects as
> the query-separator-character. But we don’t need to escape “/”, because
> we’ve left the path part of the URI behind or “?” because we’re already in
> the query.
>
> This matters because someone might have used a filename in a query
> parameter:
>
>   query-parameters: { "fn": "/home/ndw/config.xml" }
>
> And it’s possible the application will misinterpret:
>
>   https:// … ?fn=%2fhome%2fndw%2fconfig.xml
>
> if it was expecting
>
>   https:// … ?fn=/home/ndw/config.xml
>
> Finally, in the fragment identifier, we don’t need to escape “=” and “&”
> (the query separator), but we also don’t need to escape “/” and “?”.
>
> This matters because
>
>   scheme:// … #test/this
>
> might match a fragment in a document where this doesn’t:
>
>   scheme:// … #test%2fthis
>
> A few observations:
>
> 1. My goodness this is fiddly. It’s fiddly to specify, fiddly to test, and
> probably fiddly to understand.
>
> 2. It will never be perfect. Whatever rules we adopt, it will be possible
> for someone to write, or need to use, a service that has different rules.
>
> 3. For most users, most of the time, none of these rules apply. Most URIs
> don’t have parameters or fragment identifiers. 99.99+% of the ones that do
> have simple key/value pairs consisting of keys with alphanumeric names and
> values that are numbers or strings that don’t need to distinguish between
> “/” and “%2f”. 99.999+% of fragment identifiers are just alphanumeric
> strings (they have to be NCNames in XML).
>
> I’m inclined to say that we almost have this right and we should try to
> finish it up. But I’ve said that before and been wrong. I also think that
> Christian may be right. Righter than me.
>
> As I look at this gordian knot of fiddly rules and special cases, I am
> tempted to reach for a sword.
>
> Specifically this one:: all control characters (including space) are
> %-encoded, and all URI reserved characters are %-encoded, always. And
> nothing else.
>
> As a reminder, the URI reserved characters are:
>
>       reserved    = gen-delims / sub-delims
>
>       gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
>
>       sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
>                   / "*" / "+" / "," / ";" / "="
>
> The downside of this approach is that it will not always produce the
> answer most likely to work correctly (as I understand the problem space). A
> “,” or “(“ or “@” in a path segment will be encoded. A “/” or “#” in a
> query or fragment identifier will be encoded.
>
> Given that I already said it will *never* be perfect, this just produces a
> perhaps slightly different set of circumstances in which the answer isn’t
> perfect. The user isn’t stuck, they can write code to do the construction
> themselves.
>
> The upside of this approach is that it’s a smaller, simpler set of rules
> and the rules are drawn directly from RFC 3986, we didn’t invent any of it.
>
> Should I toss my PR aside and apply these rules instead?
>
>                                         Be seeing you,
>                                           norm
>
> --
> Norm Tovey-Walsh
> Saxonica
>

Received on Monday, 2 September 2024 16:29:36 UTC