Re: Incompleteness of reference resolution algorithm in RFC 3986

> On Aug 24, 2022, at 7:31 AM, Shang Ye <yesh25@mail2.sysu.edu.cn> wrote:
> 
> Hi all,
> 
> It has been noted that according to Section 5 of RFC 3986, resolving the
> relative reference `.///bar` against the absolute URI `foo:bar` (or `.//bar`
> against `foo:/bar`) results in a URI `foo://bar`, in which the resolved path
> component starts with `//` (not allowed as per RFC 3986) and effectively
> becomes an authority component. This behavior has caused issues in several
> implementations of RFC 3986 [1].

Those all seem to be speculative issues from the same reporter.

https://datatracker.ietf.org/doc/html/rfc3986/#section-1.2.3

   A relative reference (Section 4.2) refers to a resource by describing
   the difference within a hierarchical name space between the reference
   context and the target URI.  The reference resolution algorithm,
   presented in Section 5, defines how such a reference is transformed
   to the target URI.  As relative references can only be used within
   the context of a hierarchical URI, designers of new URI schemes
   should use a syntax consistent with the generic syntax's hierarchical
   components unless there are compelling reasons to forbid relative
   referencing within that scheme.

A base URI of `foo:bar` does not use a hierarchical syntax and thus
cannot be used for relative references other than same-document fragments.

This does not prevent a parser from taking that base URI and any
reference string, turning the crank on the input, and generating a
syntactically valid URI string as a result. It only means that a
preconception about what such a result is supposed to contain
is not supported by the algorithms. IOW, a result that looks like
an authority component is just as valid as any other result.

> Prior to this report, the WHATWG URL Standard has been revised to fix a
> similar issue, by prepending `/.` to the path when necessary [2]. There was
> a recent attempt to fit the WHATWG solution into an RFC 3986 implementation,
> but without much success due to limited applicability [3].

It appears to be a limited patch to change one meaningless
result into a different meaningless result when the relative
resolution algorithm is being used with a base URI that
doesn't have a hierarchical syntax. That seems to be a
reasonable workaround to support their specific test harness,
but it's far outside the scope of the existing standard.

Relative resolution is not supposed to be "idempotent".
My guess is that they expect the resolved components to
round-trip into the same components when the output
reference is parsed again, which should be the case for
all valid uses of relative references.

It would also be fine to accept the output as defined
by the RFC, resulting in a URI that may or may not fit within
the syntax of that scheme. It is not the resolution parser's
job to enforce scheme-specific syntax. In this case, it looks
rather arbitrary that making the resulting path absolute
ought to be preferred to letting the new string contain
what looks like an authority component.

It would also be fine for a future RFC to change the algorithm
such that the base path is checked for an expected hierarchical
syntax before attempting to merge paths, and then enumerate all
of the potential ways that error can be handled, but I don't
think we could require one over the others. Any choice we make
here would result in most parsers being non-conformant,
just to support an invalid and irrelevant use case.

> Another potential issue I found is that resolving `../bar` against `foo:bar/`
> gives `foo:/bar`, in which a root emerges out of nowhere. Not sure if this is
> a real problem, but IMHO it may be more correct for the `remove_dot_segments`
> algorithm to preserve the relativity of paths, i.e., not to output an absolute
> path when the input is relative.

If relativity of paths is desirable, then the base URI path
has to be hierarchical according to the RFC. If it isn't,
then any assumption about which should be preferred is
equally wrong.

> I'm not much of an expert in URIs, but I wonder if it is worth an errata
> report or an update to the RFC. Any thoughts on this?
> 
> [1] = https://github.com/lo48576/iri-string/issues/8
>    ; https://github.com/sgodwincs/uriparse-rs/issues/20
>    ; https://github.com/python-hyper/rfc3986/issues/85
> [2] = https://github.com/whatwg/url/pull/505
> [3] = https://github.com/lo48576/iri-string/issues/29
> 
> Regards,
> Shang

Well, it isn't an errata. This was an intentional result of the
standards process, specifically because a group of people did
not want relative processing to be defined for schemes that chose
not to use the hierarchical syntax reserved by "/".

Whether that's a good idea or not is a different issue.

The current RFC correctly defines the result of relative
resolution to be a string, not the set of components that
happen to be in the target before the string is output.
Hence, the RFC's output is as intended and there is no
expectation that the result can be re-parsed into the
same components.

However, we could make a choice (the next time around)
that all of the path processing is distinct from the other
components, in which case we would need to specifically
handle the case of a non-hierarchical base URI (or at least
one that doesn't have an absolute or empty [abempty]
path component) just to keep things from getting weird.
Such choices are likely to result in unexpected consequences.

Cheers,

....Roy

Received on Thursday, 25 August 2022 17:45:35 UTC