Re: [whatwg/url] Parser generates invalid URLs (#379) from Alwin Blok on 2025-02-15 (public-webapps-github@w3.org from February 2025)

From: Alwin Blok <notifications@github.com>
Date: Fri, 14 Feb 2025 22:49:27 -0800
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/379/2660776619@github.com>

alwinb left a comment (whatwg/url#379)

> I also think it's more complicated than individual code points, e.g., validity of % depends on subsequent code points.

That’s no reason to not clarify or clean up the description of normalisation and validation of individual code points. I think the confusion in #852 is a good indicator that it could be nice and helpful to give it a bit more attention.

Beyond the invalid encode sequences you mention, almost all other potential invalidities are of a more structural nature; such as: An authority with an empty or absent host cannot have credentials and/or a port. And; the host of a non special url is an opaque host or an ipv6 address. And; an opaque host in a special url must be further parsed as a domain-or-IPv4 address. And; the authority of a file cannot have credentials or a port. And; it is not possible to use an URL that has an opaque path as a base url, unless the input consists only of a fragment.

Typically these can be expressed as invariants on the URL record, as opposed to being requirements on the percent encoded strings stored within it.
Moreover, those cause the parser to either reject the input, or parse it differently, and don’t seem to contribute to the issue that _the parser can produce invalid URLs_.

As a bit of contextual motivation; I am expressing here the idea of URLs being a datastructure that contains percent encoded strings at its leafs — A design that is inherited from generic URI that allows for application specific further processing; Which in turn needs a mechanism to distinguish between percent encoded– and verbatim reserved code points. (Example being e.g. `=` in the query).

The domain and the search params are indeed further specialisations/ interpretations of those percent encoded strings at the leafs — Further parsing the opaque host and the query.

What I suggested above was to include a table that applies to the constraints on the percent encoded strings specifically, one that can be used to at a glance see how individual code points in a component are/ought to be both encoded *and validated*. (I’ll post an image below).

So far, that’s editorial, and not about actually alleviating the discrepancy between valid, parsable, and parser producible.

> It's very easy to produce invalid HTML, e.g., `<b><div>blah</div></b>`. It's also possible at the syntactic level through APIs. HTML is really quite a bit worse at this.

One of the reasons I’ve done less on the URL front is that I’ve been in the belly of the whale with regard to the html standard. And you are right that the situation is somewhat similar.

But you seem to argue that it is undesirable there too. So this is **not** a good argument to not address the issue here. :D

As to which;

Can we have a look at this issue again, where we do not consider opaque paths? Because then the discrepancy between valid and parser-produced is much (much!) smaller and can potentially be bridged.

Finally; the very least we could do is to update the table at the start of section 4. URLs; and add a column that says if the output is valid or not.

--
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/379#issuecomment-2660776619
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/379/2660776619@github.com>

Received on Saturday, 15 February 2025 06:49:31 UTC