[whatwg/url] It is unlikely that the spec's two definitions of valid URL strings are equal (Issue #704)

(Background note: _valid_ URL strings are completely different from _parseable_ URL strings. See https://url.spec.whatwg.org/#writing and https://url.spec.whatwg.org/#urls , especially the table in the latter, for background.)

The spec currently gives two ways of determining whether an input string is considered "valid":

- A string is a valid URL string if it follows the psuedo-grammar given in https://url.spec.whatwg.org/#url-writing . (This is the actual `<dfn>` of "valid URL string".)
- A string is supposedly a valid URL string if running the [URL parser](https://url.spec.whatwg.org/#concept-url-parser) on it never results in a step that invokes the [validation error](https://url.spec.whatwg.org/#validation-error) concept.

The latter definition is not precisely specified anywhere; the only thing you can really infer from the spec is that running the parser on some strings sometimes gives as a side effect, one or more validation errors. There's no actual bridging text to a concept like "valid URL string", that I can find.

But I think the intent is for these two to be equivalent.

The problem is, there's absolutely no guarantee this is the case. One instance of a mismatch was found in https://github.com/whatwg/url/issues/437, but I suspect there are many more. As I commented there, I think it would be a fun project to first write an implementation of the parser which tracks validation errors, and then write an implementation of the #url-writing section, and then compare what results those two pieces of code get on a bunch of different (URL string, base URL) inputs. (Sounds like a job for a fuzzer!)

We could solve this in one of two ways:

- Delete one of the two definitions of valid URL string
- Make sure they align, and maybe try to maintain that via automated fuzzing somehow.

The latter sounds very silly and redundant to me at first glance. However:

- The parser-produced validation errors have the nice property that they can be gotten almost for free as a side effect of parsing. (The hardest part is implementing "is a URL code point", which a non-validating parser doesn't need at all.) They also could easily be given names that allow you to track down exactly what went wrong; see #406. Whereas the grammar-ish #url-writing section just gives you a boolean predicate. So that argues for keeping the parser-produced validation errors...

- On the other hand, #479 indicates people do seem to like the idea of a short grammar-ish algorithm for determining validity. It's not 100% clear whether they are distinguishing validity from parseability; in particular, the fact that #url-writing has existed forever but people have spent 77 comments complaining about the spec lacking a grammar, indicates maybe #url-writing is not that helpful and what they're really complaining about is how the URL parser is complicated to implement. But I'm not sure.

So, I'm not sure where that really leaves us. But I wanted to log this issue so we had a canonical place to record this unfortunate fact about the current standard.

See also https://github.com/jsdom/whatwg-url/issues/156.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/704
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/704@github.com>

Received on Friday, 2 September 2022 02:45:34 UTC