Re: [whatwg/url] Grammar specification for URLs (#24)

> @sjamaan
> 
> If an overview of what a URL looks like is desired, https://url.spec.whatwg.org/#url-writing should be more than enough. With the exception of the definition of a [valid domain](https://url.spec.whatwg.org/#valid-domain), it could possibly even be translated into a formal grammar for valid URLs!

That sounds interesting, I might give that a try!

> The main reason why this spec exists is the fact that it defines error handling rigorously. This is precisely what a formal grammar that defines what a valid URL is cannot do.

I see there are two error states: "return failure" and "validation error". I think "return failure" is not specific enough to indicate what failure, so this can be handled simply by rejecting the language. "validation error" could be simply handled by a production that's annotated with "validation error" (like a semantic action in a parser generator). I also see "validation error, return failure", which to me seems a bit inconsistent/unnecessary.

The only problem I see so far is the encoding override stuff, but apparently that's considered legacy anyway, so maybe it can be discussed separately somehow.

> Web content is unfortunately overwhelmingly erroneous, and part of the mission of the WHATWG is to make the web interoperable, which includes handling errors in a uniform fashion. For URL parsers outside of a browser, failing on an invalid URL may be an option. For web browsers however, it is not for the most part.

See above. And even if adding error states to BNF is deemed impossible, this could be handled by a piece of prose. This doesn't reduce the usefulness of a formal grammar, which is about what syntax is acceptable, which is useful for every implementation to verify whether they are correctly accepting valid URLs. If an URL is not accepted, one would want to check whether the spec intended it to be invalid. Just eyeballing a BNF for this takes a couple of seconds, minutes at most. Mentally running the algorithm would also work but it's more involved as you need to keep track of state and so on.

> With regards to security concerns, a solution would be for everyone to adopt this standard's error handling behavior :D Indeed, we are seeing more and more adoption of this standard in the standard library of a language or runtime, like with Rust's [url](https://docs.rs/url/1.7.1/url/) crate, and Node.js' `url` module.

I'll study that documentation, this sounds interesting and a useful addition to uri-generic too.

> I hope this has answered your question about:
> 
> > it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm".
> 
> Unlike the RFC, we intend to fully define the behavior when an invalid URL is encountered. This leads to a well-defined difference between "valid" and "parsable": the former is generally easy to tell, the latter unfortunately possibly not so much.

Parser generators are notoriously bad at error handling, but that's mostly because you'd need to extend the grammar with explicit failure states. I think this would actually be a place where the WHATWG spec could add value; by providing an officially sanctioned BNF augmented with error states!

Writing down such a BNF would be a bit more involved, but might still be worthwhile.

Thank you for taking the time to write out a detailed response. I'll think this through a bit more and hope to come up with a grammar soonish. Would that be doable via pull request?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/24#issuecomment-420554844

Received on Wednesday, 12 September 2018 08:13:21 UTC