Re: [whatwg/url] Support relative URLs (#531)

@zamfofex thank you, that is a nice summary! 

I think that the most important part is not the API though, but the model of URLs underneath. 

The parser that is used in the standard at the moment, simply cannot support relative URLs (without major changes, at least). And after having worked on my library, I can understand why, because it was a really complicated and frustrating process to come up with something compliant that could! I'd forgive people for thinking that it cannot be done at all. 

I'll sketch part of my solution, for the discussion here. 

* * * 

The _force_ operation is one key part of the solution. 
Consider the issue of repeated slashes: 

1. `http:foo/bar`
2. `http:/foo/bar`
3. `http://foo/bar`
4. `http:///foo/bar`

According to the standard all of these 'parse' (ie. parse-and-resolve) to the same URL. However, when 'parsed against a base URL' they behave differently. So you cannot just use:

* special-url := [special-scheme `:`] [(`/`|`\`)* authority] [path-root] [relative-path] [`?` query] [`#` hash]

or something like that, as a grammar, because then you'd fail to resolve correctly when a base URL is supplied. (I'm using square brackets for optional rules here). So you need to start off with a classic rule that has two slashes before the authority. 

My first parser phase is very simple and parses them as such:

1. (**scheme**`"http"`) (**dir**`"foo"`) (**file**`"bar"`)
2. (**scheme**`"http"`) (**path-root**`"/"`) (**dir**`"foo"`) (**file**`"bar"`)
3. (**scheme**`"http"`) (**auth-string**`"foo"`) (**path-root**`"/"`) (**file**`"bar"`)
4. (**scheme**`"http"`) (**auth-string** `""`) (**path-root**`"/"`) (**dir**`"foo"`) (**file**`"bar"`)

From there, 

* It detects drive letters, via an operation on this structure, and it parses the authority from the auth-string. 
* Then, the _goto_ operation, is quite like the '[non-strict merge][1]' of RFC 3986. So this is nice, it is just a classic algorithm, and it is very simple. 
* Finally, _force_, solves the problem of the multiple slashes. If the (special) URL does not have an authority, or if its authority is empty, then it 'steals' an authority-string from the first non-empty dir-or-file, and it invokes the authority parser on that.  
I like this solution, because it matches the standard, but it also respects the RFC. This is indeed a 'force' that is only applied as an error-recovery strategy. 

[1]: https://tools.ietf.org/html/rfc3986#section-5.2.2


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/531#issuecomment-702015715

Received on Thursday, 1 October 2020 09:37:09 UTC