Re: Incompleteness of reference resolution algorithm in RFC 3986

Roy,

Thanks for your detailed explanation. Some comments in line:

> On Aug 25, 2022, Roy T. Fielding <fielding@gbiv.com> wrote:
> 
>> On Aug 24, 2022, at 7:31 AM, Shang Ye <yesh25@mail2.sysu.edu.cn> wrote:
>> 
>> Hi all,
>> 
>> It has been noted that according to Section 5 of RFC 3986, resolving the
>> relative reference `.///bar` against the absolute URI `foo:bar` (or `.//bar`
>> against `foo:/bar`) results in a URI `foo://bar`, in which the resolved path
>> component starts with `//` (not allowed as per RFC 3986) and effectively
>> becomes an authority component. This behavior has caused issues in several
>> implementations of RFC 3986 [1].
>
> Those all seem to be speculative issues from the same reporter.
>
> https://datatracker.ietf.org/doc/html/rfc3986/#section-1.2.3

>
>   A relative reference (Section 4.2) refers to a resource by describing
>   the difference within a hierarchical name space between the reference
>   context and the target URI.  The reference resolution algorithm,
>   presented in Section 5, defines how such a reference is transformed
>   to the target URI.  As relative references can only be used within
>   the context of a hierarchical URI, designers of new URI schemes
>   should use a syntax consistent with the generic syntax's hierarchical
>   components unless there are compelling reasons to forbid relative
>   referencing within that scheme.
>
> A base URI of `foo:bar` does not use a hierarchical syntax and thus
> cannot be used for relative references other than same-document fragments.

Sorry that I caused some confusion here, but if I had read the quoted text
correctly, it is the URI scheme, rather than a specific URI, that uses a
hierarchical syntax. In this example, the `foo` scheme may require a syntax of

    foo-URI = "foo:" path-rootless

where

    path-rootless = segment-nz *( "/" segment )

and allow relative referencing within that scheme.

However, I notice that unlike in RFC 2396, the text in RFC 3986 does not
explicitly tell what a "hierarchical" URI is and whether its path can be
relative or not. Section 5.2.4 says

   This is done after the path is
   extracted from a reference, whether or not the path was relative, in
   order to remove any invalid or extraneous dot-segments prior to
   forming the target URI.

From what I understand, this indicates that the path may be relative in a
hierarchical URI. Could you please clarify a bit on this?

> This does not prevent a parser from taking that base URI and any
> reference string, turning the crank on the input, and generating a
> syntactically valid URI string as a result. It only means that a
> preconception about what such a result is supposed to contain
> is not supported by the algorithms. IOW, a result that looks like
> an authority component is just as valid as any other result.
> 
> > Prior to this report, the WHATWG URL Standard has been revised to fix a
> > similar issue, by prepending `/.` to the path when necessary [2]. There was
> > a recent attempt to fit the WHATWG solution into an RFC 3986 implementation,
> > but without much success due to limited applicability [3].
> 
> It appears to be a limited patch to change one meaningless
> result into a different meaningless result when the relative
> resolution algorithm is being used with a base URI that
> doesn't have a hierarchical syntax. That seems to be a
> reasonable workaround to support their specific test harness,
> but it's far outside the scope of the existing standard.
> 
> Relative resolution is not supposed to be "idempotent".
> My guess is that they expect the resolved components to
> round-trip into the same components when the output
> reference is parsed again, which should be the case for
> all valid uses of relative references.
> 
> It would also be fine to accept the output as defined
> by the RFC, resulting in a URI that may or may not fit within
> the syntax of that scheme. It is not the resolution parser's
> job to enforce scheme-specific syntax. In this case, it looks
> rather arbitrary that making the resulting path absolute
> ought to be preferred to letting the new string contain
> what looks like an authority component.
> 
> It would also be fine for a future RFC to change the algorithm
> such that the base path is checked for an expected hierarchical
> syntax before attempting to merge paths, and then enumerate all
> of the potential ways that error can be handled, but I don't
> think we could require one over the others. Any choice we make
> here would result in most parsers being non-conformant,
> just to support an invalid and irrelevant use case.
> 
> > Another potential issue I found is that resolving `../bar` against `foo:bar/`
> > gives `foo:/bar`, in which a root emerges out of nowhere. Not sure if this is
> > a real problem, but IMHO it may be more correct for the `remove_dot_segments`
> > algorithm to preserve the relativity of paths, i.e., not to output an absolute
> > path when the input is relative.
> 
> If relativity of paths is desirable, then the base URI path
> has to be hierarchical according to the RFC. If it isn't,
> then any assumption about which should be preferred is
> equally wrong.
> 
> > I'm not much of an expert in URIs, but I wonder if it is worth an errata
> > report or an update to the RFC. Any thoughts on this?
> > 
> > [1] = https://github.com/lo48576/iri-string/issues/8

> >    ; https://github.com/sgodwincs/uriparse-rs/issues/20

> >    ; https://github.com/python-hyper/rfc3986/issues/85

> > [2] = https://github.com/whatwg/url/pull/505

> > [3] = https://github.com/lo48576/iri-string/issues/29

> > 
> > Regards,
> > Shang
> 
> Well, it isn't an errata. This was an intentional result of the
> standards process, specifically because a group of people did
> not want relative processing to be defined for schemes that chose
> not to use the hierarchical syntax reserved by "/".
> 
> Whether that's a good idea or not is a different issue.
> 
> The current RFC correctly defines the result of relative
> resolution to be a string, not the set of components that
> happen to be in the target before the string is output.
> Hence, the RFC's output is as intended and there is no
> expectation that the result can be re-parsed into the
> same components.
> 
> However, we could make a choice (the next time around)
> that all of the path processing is distinct from the other
> components, in which case we would need to specifically
> handle the case of a non-hierarchical base URI (or at least
> one that doesn't have an absolute or empty [abempty]
> path component) just to keep things from getting weird.
> Such choices are likely to result in unexpected consequences.
> 
> Cheers,
> 
> ....Roy

I mostly agree with you. Yet I seem to have found a loophole in the relative
resolution algorithm, where the output string would not be a valid URI:

Resolving `.//@@` or `.//::` against `a:/b` gives `a://@@` or `a://::`. It is
not allowed for an authority component to contain more than one `@`, or to
contain more than one `:` outside the userinfo subcomponent, as per the syntax
rules.

I don't know if this means anything, but I hope it will help.

Regards,
Shang

Received on Saturday, 27 August 2022 15:37:42 UTC