- From: Alwin Blok <notifications@github.com>
- Date: Sat, 05 Jun 2021 13:16:41 -0700
- To: whatwg/url <url@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/url/issues/479/855290347@github.com>
Following up on this:
> This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.
I have done more research, esp. around the character sets, making some tools to compute the differences. These are my findings. I will follow up with a post about other, minor grammar changes and reference resolution.
The differences will be very small, after all is said and done. Which is great!
## Character Sets
### IRI vs WHATWG URL
The codepoints allowed in the components of _valid_ WHATWG URLs are _almost_ the same as in RFC3987 IRIs. There is only one difference:
- WHATWG URLs allow more non-ASCII unicode code points in components.
Specifically, the WHATWG Standard allows the additional codepoints:
- The Private Use Areas: { u+E000-u+F8FF, u+F0000-u+FFFFD, u+100000-u+10FFFD }.
- Specials, minus the non-characters: { u+FFF0-u+FFFD }
- Tags and variation selectors, specifically, { u+E0000-u+E0FFF }.
Specials _are_ allowed in the query part of an IRI, not in the other components though.
### IRI vs loose-WHATWG URL
Let me call any input that the 'basic url parser' accepts as a single argument, a 'loose-WHATWG URL'.
_Note_: The IRI grammar does not split the userinfo into a username and password, but RFC3986 (URI) suggests in 3.2.1. that the _first_ `:` separates the username from the password. So I assume this in what follows. Note though that _valid_ WHATWG URLs do not allow username and password components at all.
To go from IRIs to loose WHATWG URLs, allow any non-ASCII unicode code point in components, and a number of additional ASCII characters as well. Let's define iinvalid:
iinvalid := { u+0-u+1F, ` `, `"`, `<`, `>`, `[`, `]`, `^`, <code>`</code>, `{`, `|`, `}`, u+7F }
Then, for the components:
- **username**: add iinvalid and `@` (but remove `:`).
- **password**: add iinvalid and `@`.
- **opaque-host**: add a subset of iinvalid: { u+1-u+8, u+B-u+C, u+E-u+1F, `"`, <code>`</code>, `{`, `}`, u+7F }
- **path component**: Add iinvalid.
- **query**: add iinvalid.
- **fragment**: add iinvalid and `#`.
- For non-special loose WHATWG URLs also add `\` to all the above except for opaque-host.
The grammar would have to be modified to allow invalid percent escape sequences: a single `%` followed by zero or one hexdigits, (but not two).
Note that the WHATWG parser removes tabs and newlines { u+9, u+A, u+D } in a preprocessing pass, so you may choose to exclude those from the iinvalid set. Preprocessing also removes leading and trailing sequences of { u+0-u+20 } (aka c0-space), but it's not a good idea to try and express that in the grammar.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/479#issuecomment-855290347
Received on Saturday, 5 June 2021 20:17:21 UTC