- From: Karl <notifications@github.com>
- Date: Wed, 30 Aug 2023 03:47:49 -0700
- To: whatwg/url <url@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/url/issues/782/1698928110@github.com>
Overall, I'm in favour of adding this and would like us to discuss and resolve any issues so it can be implemented in a consistent way by JS and other URL libraries. Programmers expect to parse a URL string on the web or in their native applications and to receive the same result, and that is why developers are creating libraries which implement the WHATWG URL standard's parser. I think developers have the same expectation when constructing a URL from a set of components, and so it is worth producing a specification for how that operation should behave and collaborating to ensure the implementation is robust and accounts for the various edge-cases. > Sprinkle some appropriate encode steps in this and you're golden. I skipped them for brevity and because figuring out precisely where and what kind of escaping is needed is more work than I want to do for a proposal. I think we _do_ need to think about it. Anne mentioned it in the previous issue before it was closed, so it seems like it was the blocking question: > in particular it would be hard to ensure that the individual parts cannot affect the other individual parts I think I can answer this, and I don't think it's actually so hard. Each component's percent-encode set already includes the delimiters of later components (e.g. the query set includes `#`, the path set is the query set plus `?`, the userinfo set is the path set plus `/`, etc). That means if we have some arbitrary string and encode it using, say, the path encode-set, it will never contain a naked `?` or `#`. Therefore, if we place that escaped string in the path's position, it will not introduce additional delimiters and affect components after the path. --- I wonder if this API should have an option which disables additional percent-escaping and fails instead. For instance, it might be important to me that: ```javascript URL({ ..., pathname = x }).pathname == x ``` --- > If |hostname| is present: > Append |hostname|, [=serialized=], to |output|. > If |port| is present, append ":" followed by |port| to |output|. The hostname would need to go through the host parser (which depends on the scheme and may fail), and the port would need to be validated to ensure it is a number. The other components are basically opaque so there's no validation to do. --- > If |path| is present: > If |path| is a DOMString, set |path| to be a list containing |path|. > For each |segment| of |path|, append "/" followed by |segment| to |output|. The path will need to be simplified. For instance, it might contain `.` or `..` components, Windows drive letters, etc. For a string, we would need to split it in to components, but this would also be the first API which exposes the path as a sequence of segments, and we would need to perform some additional escaping to ensure those segments are preserved as given. For instance: ```javascript // If we're only given a string, "AC/DC" looks like 2 path segments. // We have no way to tell the difference. URL({ ..., pathname = "/bands/AC/DC" }).pathname == "/bands/AC/DC" // But if the user tells us "AC/DC" should be 1 segment, we'd have to escape it. URL({ ..., pathname = ["bands", "AC/DC"] }).pathname == "/bands/AC%2FDC" ``` It's not a significant problem (we'd just need to add U+002F `/` and U+005C `\` to the path encode-set), but it's worth bearing in mind. We might also choose to escape `%`, since the user is almost certainly not giving us pre-escaped path segments. This issue hasn't come up until now because the existing parser splits the string on `/` and `\`, so it never sees path segments containing those characters. But that also means we can just add it to the path encode-set without breaking anything (...🤞) --- As I mentioned, this would be the first part of the JS URL API to expose the path as a collection of segments. In my survey of URL APIs, I found that surprisingly few libraries expose such a view. Of those that do, `rust-url` is notable because it implements this standard's interpretation of URLs, and its `PathSegmentsMut` [skips](https://docs.rs/url/latest/url/struct.PathSegmentsMut.html#method.extend) `.` and `..` segments rather than simplifying them. That's what I [chose to do](https://karwa.github.io/swift-url/main/documentation/weburl/weburl/pathcomponents-swift.struct/append(contentsof:)) in my own library, as well. So it's somewhat debatable what the following should return: ```javascript URL({ ..., pathname = ["foo", "..", "bar"] }).pathname // "/foo/bar" or "/bar"? ``` Note that we cannot escape `.` or `..` components, so "/foo/%2E%2E/bar" is not an option. -- Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/782#issuecomment-1698928110 You are receiving this because you are subscribed to this thread. Message ID: <whatwg/url/issues/782/1698928110@github.com>
Received on Wednesday, 30 August 2023 10:47:55 UTC