- From: Shang Ye <yesh25@mail2.sysu.edu.cn>
- Date: Sat, 27 Aug 2022 23:36:50 +0800
- To: "Roy T. Fielding" <fielding@gbiv.com>
- Cc: "uri" <uri@w3.org>
Roy, Thanks for your detailed explanation. Some comments in line: > On Aug 25, 2022, Roy T. Fielding <fielding@gbiv.com> wrote: > >> On Aug 24, 2022, at 7:31 AM, Shang Ye <yesh25@mail2.sysu.edu.cn> wrote: >> >> Hi all, >> >> It has been noted that according to Section 5 of RFC 3986, resolving the >> relative reference `.///bar` against the absolute URI `foo:bar` (or `.//bar` >> against `foo:/bar`) results in a URI `foo://bar`, in which the resolved path >> component starts with `//` (not allowed as per RFC 3986) and effectively >> becomes an authority component. This behavior has caused issues in several >> implementations of RFC 3986 [1]. > > Those all seem to be speculative issues from the same reporter. > > https://datatracker.ietf.org/doc/html/rfc3986/#section-1.2.3 > > A relative reference (Section 4.2) refers to a resource by describing > the difference within a hierarchical name space between the reference > context and the target URI. The reference resolution algorithm, > presented in Section 5, defines how such a reference is transformed > to the target URI. As relative references can only be used within > the context of a hierarchical URI, designers of new URI schemes > should use a syntax consistent with the generic syntax's hierarchical > components unless there are compelling reasons to forbid relative > referencing within that scheme. > > A base URI of `foo:bar` does not use a hierarchical syntax and thus > cannot be used for relative references other than same-document fragments. Sorry that I caused some confusion here, but if I had read the quoted text correctly, it is the URI scheme, rather than a specific URI, that uses a hierarchical syntax. In this example, the `foo` scheme may require a syntax of foo-URI = "foo:" path-rootless where path-rootless = segment-nz *( "/" segment ) and allow relative referencing within that scheme. However, I notice that unlike in RFC 2396, the text in RFC 3986 does not explicitly tell what a "hierarchical" URI is and whether its path can be relative or not. Section 5.2.4 says This is done after the path is extracted from a reference, whether or not the path was relative, in order to remove any invalid or extraneous dot-segments prior to forming the target URI. From what I understand, this indicates that the path may be relative in a hierarchical URI. Could you please clarify a bit on this? > This does not prevent a parser from taking that base URI and any > reference string, turning the crank on the input, and generating a > syntactically valid URI string as a result. It only means that a > preconception about what such a result is supposed to contain > is not supported by the algorithms. IOW, a result that looks like > an authority component is just as valid as any other result. > > > Prior to this report, the WHATWG URL Standard has been revised to fix a > > similar issue, by prepending `/.` to the path when necessary [2]. There was > > a recent attempt to fit the WHATWG solution into an RFC 3986 implementation, > > but without much success due to limited applicability [3]. > > It appears to be a limited patch to change one meaningless > result into a different meaningless result when the relative > resolution algorithm is being used with a base URI that > doesn't have a hierarchical syntax. That seems to be a > reasonable workaround to support their specific test harness, > but it's far outside the scope of the existing standard. > > Relative resolution is not supposed to be "idempotent". > My guess is that they expect the resolved components to > round-trip into the same components when the output > reference is parsed again, which should be the case for > all valid uses of relative references. > > It would also be fine to accept the output as defined > by the RFC, resulting in a URI that may or may not fit within > the syntax of that scheme. It is not the resolution parser's > job to enforce scheme-specific syntax. In this case, it looks > rather arbitrary that making the resulting path absolute > ought to be preferred to letting the new string contain > what looks like an authority component. > > It would also be fine for a future RFC to change the algorithm > such that the base path is checked for an expected hierarchical > syntax before attempting to merge paths, and then enumerate all > of the potential ways that error can be handled, but I don't > think we could require one over the others. Any choice we make > here would result in most parsers being non-conformant, > just to support an invalid and irrelevant use case. > > > Another potential issue I found is that resolving `../bar` against `foo:bar/` > > gives `foo:/bar`, in which a root emerges out of nowhere. Not sure if this is > > a real problem, but IMHO it may be more correct for the `remove_dot_segments` > > algorithm to preserve the relativity of paths, i.e., not to output an absolute > > path when the input is relative. > > If relativity of paths is desirable, then the base URI path > has to be hierarchical according to the RFC. If it isn't, > then any assumption about which should be preferred is > equally wrong. > > > I'm not much of an expert in URIs, but I wonder if it is worth an errata > > report or an update to the RFC. Any thoughts on this? > > > > [1] = https://github.com/lo48576/iri-string/issues/8 > > ; https://github.com/sgodwincs/uriparse-rs/issues/20 > > ; https://github.com/python-hyper/rfc3986/issues/85 > > [2] = https://github.com/whatwg/url/pull/505 > > [3] = https://github.com/lo48576/iri-string/issues/29 > > > > Regards, > > Shang > > Well, it isn't an errata. This was an intentional result of the > standards process, specifically because a group of people did > not want relative processing to be defined for schemes that chose > not to use the hierarchical syntax reserved by "/". > > Whether that's a good idea or not is a different issue. > > The current RFC correctly defines the result of relative > resolution to be a string, not the set of components that > happen to be in the target before the string is output. > Hence, the RFC's output is as intended and there is no > expectation that the result can be re-parsed into the > same components. > > However, we could make a choice (the next time around) > that all of the path processing is distinct from the other > components, in which case we would need to specifically > handle the case of a non-hierarchical base URI (or at least > one that doesn't have an absolute or empty [abempty] > path component) just to keep things from getting weird. > Such choices are likely to result in unexpected consequences. > > Cheers, > > ....Roy I mostly agree with you. Yet I seem to have found a loophole in the relative resolution algorithm, where the output string would not be a valid URI: Resolving `.//@@` or `.//::` against `a:/b` gives `a://@@` or `a://::`. It is not allowed for an authority component to contain more than one `@`, or to contain more than one `:` outside the userinfo subcomponent, as per the syntax rules. I don't know if this means anything, but I hope it will help. Regards, Shang
Received on Saturday, 27 August 2022 15:37:42 UTC