Re: [whatwg/url] Strictness on Port doesn't conform to URL/URI RFCs (Issue #883) from The Moisrex on 2025-10-05 (public-webapps-github@w3.org from October 2025)

From: The Moisrex <notifications@github.com>
Date: Sun, 05 Oct 2025 08:32:01 -0700
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/883/3369134129@github.com>

the-moisrex left a comment (whatwg/url#883)

> I'd challenge you to beat my logic:

First: your initial example implied that hex-looking pieces of information is being utilized. You later expanded that, then I came up with the second solution which is more appropriate than the first for the example usage.

Second: The fact that in your example you're parsing a URL with `.split`s proves my point.

The problem is you're unaware of the complexity of the WHATWG URL parser.

Look how a [real URL WHATWG parser is doing to parse '@'](https://github.com/ada-url/ada/blob/f3ad8a179ec276cb3b4b9e908da6bdf2dcb196bb/src/parser.cpp#L230-L243). You first have to find out where the '@' is, **even if you don't have it in the URL**.

In my parser, which is still under development, I assume userinfo section don't exist, and continue parsing, and if I see '@', I rollback; I do this to put the slow path on URLs that have user info. It's a trade-off to make parsing the normal URLs faster. Even Ada-URL which is used in nodejs has TODO about this.

So, yeah, you're using domain-names-only, you're parsing less things than when you're parsing user-info and port and password as well. Even though for small URLs which is most of the URLs you wouldn't notice it without benchmarking.

Yes, there are a lot of this random rules that has caused URL parsers to double-pass or triple-pass through pieces of URLs to figure things out.

Another random one is that newlines and tabs for some reason need to be removed from a string.

For example, spaces in domains are not okay, but newlines are:

```js
> new URL("http://example.com")
URL {
  href: 'http://example.com/',
  origin: 'http://example.com',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'example.com',
  hostname: 'example.com',
  port: '',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}
> new URL("http://exa mple.com")
Uncaught TypeError: Invalid URL
    at new URL (node:internal/url:828:25) {
  code: 'ERR_INVALID_URL',
  input: 'http://exa mple.com'
}
> new URL("http://exa\nmple.com")
URL {
  href: 'http://example.com/',
  origin: 'http://example.com',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'example.com',
  hostname: 'example.com',
  port: '',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}
``` 

All of these things have costs, you can't do `.split`s on raw URLs and expect things to work. WHATWG URL is far more complicated than that.

But, with the second solution that I proposed, by using tagged-sub-domains, you can safely use `.split` on your URLs after you've got the domain out of it using WHATWG-compatible URL parser. Even though this code is wasteful, but it is a very clean and understandable code, that is actually work no matter what you throw at it:

```js
const url = new URL("...").hostname.split('.');
const ua = url.find(a => a.startsWith("ua--"));
const us = url.find(a => a.startsWith("us--"));
// ...
```

---

> We can't use - since it's url safe and people may register host--tw3.eth as a domain for malicious reasons so let's use :

You can't register `us--` or `ua--` domains AFAIK (but URL parsers accept it), but you can with `host--tw3`, but you can use them as sub-domains but you shouldn't.

Again, there are other solutions as well. Don't use chain id, as a port number just because both of them are numbers; that's a random rule.



-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/883#issuecomment-3369134129
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/883/3369134129@github.com>

Received on Sunday, 5 October 2025 15:32:05 UTC