Re: [whatwg/url] Provide a succinct grammar for valid URL strings (#479)

Following up on this:

> This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.

I have done more research, esp. around the character sets, making some tools to compute the differences. These are my findings. I will follow up with a post about other, minor grammar changes and reference resolution. 

The differences will be very small, after all is said and done. Which is great!

## Character Sets

### IRI vs WHATWG URL

The codepoints allowed in the components of _valid_ WHATWG URLs are _almost_ the same as in RFC3987 IRIs. There is only one difference:

- WHATWG URLs allow more non-ASCII unicode code points in components.

Specifically, the WHATWG Standard allows the additional codepoints:
- The Private Use Areas: { u+E000-u+F8FF, u+F0000-u+FFFFD, u+100000-u+10FFFD }.
- Specials, minus the non-characters: { u+FFF0-u+FFFD }
- Tags and variation selectors, specifically, { u+E0000-u+E0FFF }.

Specials _are_ allowed in the query part of an IRI, not in the other components though.

### IRI vs loose-WHATWG URL

Let me call any input that the 'basic url parser' accepts as a single argument, a 'loose-WHATWG URL'. 

_Note_: The IRI grammar does not split the userinfo into a username and password, but RFC3986 (URI) suggests in 3.2.1. that the _first_ `:` separates the username from the password. So I assume this in what follows. Note though that _valid_ WHATWG URLs do not allow username and password components at all. 

To go from IRIs to loose WHATWG URLs, allow any non-ASCII unicode code point in components, and a number of additional ASCII characters as well. Let's define iinvalid: 

iinvalid := { u+0-u+1F, ` `, `"`, `<`, `>`, `[`, `]`, `^`, <code>&#x60;</code>, `{`, `|`, `}`, u+7F }

Then, for the components:

- **username**: add iinvalid and `@` (but remove `:`).
- **password**: add iinvalid and `@`.
- **opaque-host**: add a subset of iinvalid: { u+1-u+8, u+B-u+C, u+E-u+1F, `"`, <code>&#x60;</code>, `{`, `}`, u+7F }
- **path component**: Add iinvalid.
- **query**: add iinvalid.
- **fragment**: add iinvalid and `#`.
- For non-special loose WHATWG URLs also add `\` to all the above except for opaque-host. 

The grammar would have to be modified to allow invalid percent escape sequences: a single `%` followed by zero or one hexdigits, (but not two). 

Note that the WHATWG  parser removes tabs and newlines { u+9, u+A, u+D } in a preprocessing pass, so you may choose to exclude those from the iinvalid set. Preprocessing also removes leading and trailing sequences of { u+0-u+20 } (aka c0-space), but it's not a good idea to try and express that in the grammar. 



-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/479#issuecomment-855290347

Received on Saturday, 5 June 2021 20:17:21 UTC