- From: Alwin Blok <notifications@github.com>
- Date: Sat, 05 Jun 2021 13:16:41 -0700
- To: whatwg/url <url@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/url/issues/479/855290347@github.com>
Following up on this: > This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs. I have done more research, esp. around the character sets, making some tools to compute the differences. These are my findings. I will follow up with a post about other, minor grammar changes and reference resolution. The differences will be very small, after all is said and done. Which is great! ## Character Sets ### IRI vs WHATWG URL The codepoints allowed in the components of _valid_ WHATWG URLs are _almost_ the same as in RFC3987 IRIs. There is only one difference: - WHATWG URLs allow more non-ASCII unicode code points in components. Specifically, the WHATWG Standard allows the additional codepoints: - The Private Use Areas: { u+E000-u+F8FF, u+F0000-u+FFFFD, u+100000-u+10FFFD }. - Specials, minus the non-characters: { u+FFF0-u+FFFD } - Tags and variation selectors, specifically, { u+E0000-u+E0FFF }. Specials _are_ allowed in the query part of an IRI, not in the other components though. ### IRI vs loose-WHATWG URL Let me call any input that the 'basic url parser' accepts as a single argument, a 'loose-WHATWG URL'. _Note_: The IRI grammar does not split the userinfo into a username and password, but RFC3986 (URI) suggests in 3.2.1. that the _first_ `:` separates the username from the password. So I assume this in what follows. Note though that _valid_ WHATWG URLs do not allow username and password components at all. To go from IRIs to loose WHATWG URLs, allow any non-ASCII unicode code point in components, and a number of additional ASCII characters as well. Let's define iinvalid: iinvalid := { u+0-u+1F, ` `, `"`, `<`, `>`, `[`, `]`, `^`, <code>`</code>, `{`, `|`, `}`, u+7F } Then, for the components: - **username**: add iinvalid and `@` (but remove `:`). - **password**: add iinvalid and `@`. - **opaque-host**: add a subset of iinvalid: { u+1-u+8, u+B-u+C, u+E-u+1F, `"`, <code>`</code>, `{`, `}`, u+7F } - **path component**: Add iinvalid. - **query**: add iinvalid. - **fragment**: add iinvalid and `#`. - For non-special loose WHATWG URLs also add `\` to all the above except for opaque-host. The grammar would have to be modified to allow invalid percent escape sequences: a single `%` followed by zero or one hexdigits, (but not two). Note that the WHATWG parser removes tabs and newlines { u+9, u+A, u+D } in a preprocessing pass, so you may choose to exclude those from the iinvalid set. Preprocessing also removes leading and trailing sequences of { u+0-u+20 } (aka c0-space), but it's not a good idea to try and express that in the grammar. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/479#issuecomment-855290347
Received on Saturday, 5 June 2021 20:17:21 UTC