Re: [whatwg/url] Parser generates invalid URLs (#379) from Alwin Blok on 2024-12-02 (public-webapps-github@w3.org from December 2024)

From: Alwin Blok <notifications@github.com>
Date: Mon, 02 Dec 2024 15:23:11 -0800
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/379/2513182638@github.com>

Whilst I have no strong opinion about the exact encode sets (other than for `\` which ill try to expand on in #675 when I have some time),
I do want to point out that this issue is not merely about the encode sets themselves but about the fact that a parse-serialise round trip can produce **invalid** URL strings.

And that as such, this issue is not resolved!(sorry!)

Some suggestions follow.

I understand very well the reasons for the status quo, but it is nonetheless a messy situation:

We currently have to consider valid URL strings, invalid-but-tolerated URL strings, and invalid-and-rejected URL strings. Some of them are defined explicitly in the standard, others implicitly.

Furthermore, an invalid-but-tolerated URL string, may end up as either a valid URL string after a parse-serialise roundtrip, OR as a invalid-but-tolerated URL string depending on the exact input.

I agree with @domenic that the last case is bound to cause confusion. And moreover seems to be fairly easy to resolve.

I found it especially hard to get a full grasp and understanding of what the URL standard entails exactly, until I started to identify and name the different sub classes of URL strings in a similar way as above for myself.

So my advice (which you don’t have to take, yet I will do it in my own work) would be to make some editorial changes, including possibly some naming changes or additional definitions, to make it all more clear.

To back that up with more motivation, the code points that do vs do not end up being encoded is a typical thing that tends to drift across different implementations and applications. And it’s one of the hardest things to get right for implementers, because there are so many different encode sets and special cases:

I summarised in my own document (not published yet, busy with other things) we now have **per component type** (ie path segment query fragment ao) **four** different possible behaviours for very specific subsets of code points. A single code point may be:

Valid and not encoded (pass through)
Valid but encoded nonetheless
Invalid but tolerated and not encoded
Invalid and rejected, causing a parsing failure

And moreover, *with slight differences* across special and non special hierarchical URLs and large differences with the opaque path URLs, for which the path is touched as less as possible, for good reasons.

That doesn’t include single and double dot segments, which are also actually decoded.

All this is highly convoluted, it is very hard to understand and implement this correctly. So anything that can help to clarify it or clean it up should be welcome and help adoption of this standard.

--
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/379#issuecomment-2513182638
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/379/2513182638@github.com>

Received on Monday, 2 December 2024 23:23:15 UTC