Re: Invalid Characters in URLs

If we're targeting compatibility with web stuff, probably any update of the
URL spec would want to align heavily with the WHATWG one that browsers
follow.

Skimming the WHATWG URL spec, it looks like
<https://url.spec.whatwg.org/#path-state> it says such characters are
a validation
error <https://url.spec.whatwg.org/#validation-error> in paths, but
validation errors are non-fatal and the spec still defines what to do when
you proceed. In particular, those go through but some (but not all!) of
them get %-encoded internally by the parser, so they don't appear in the
actual parsed URL per se.

Although this space is kind of a mess. Playing around, it seems no two of
Chrome, Firefox, and Safari quite yet agree on which characters are
%-escaped in `new URL("https://example.com/[]{}|^").pathname`. Though I
believe this is an area folks are working on aligning, maybe? (I haven't
been following it, so I'm not sure what the current status is.)


On Thu, Sep 19, 2024 at 4:11 PM Ryan Hamilton <rch@google.com> wrote:

> Howdy Folks,
>
> We've been doing some work lately to tighten up our HTTP spec compliance,
> specifically around invalid characters in URLs
> <https://quiche.googlesource.com/quiche/+/4249f8025caed1e3d71d04d9cadf42251acb7cac/quiche/balsa/header_properties.h#54>.
> Perhaps not surprisingly, we've seen many request which include one or more
> of the following character, which are prohibited by RFC 3986: *[]{}|^*
>
> Presumably other implementers see this as well? RFC 3986 came out in 2005
> and I suspect the web has evolved significantly since then. In much the
> same way the WG is addressing the issue of invalid characters in Cookies as
> part of rfc6265bis, is there any appetite in the WG for addressing the
> issue of invalid characters in URLs?
>
> Cheers,
>
> Ryan
>

Received on Thursday, 19 September 2024 20:31:28 UTC