- From: The Moisrex <notifications@github.com>
- Date: Sun, 05 Oct 2025 08:32:01 -0700
- To: whatwg/url <url@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/url/issues/883/3369134129@github.com>
the-moisrex left a comment (whatwg/url#883) > I'd challenge you to beat my logic: First: your initial example implied that hex-looking pieces of information is being utilized. You later expanded that, then I came up with the second solution which is more appropriate than the first for the example usage. Second: The fact that in your example you're parsing a URL with `.split`s proves my point. The problem is you're unaware of the complexity of the WHATWG URL parser. Look how a [real URL WHATWG parser is doing to parse '@'](https://github.com/ada-url/ada/blob/f3ad8a179ec276cb3b4b9e908da6bdf2dcb196bb/src/parser.cpp#L230-L243). You first have to find out where the '@' is, **even if you don't have it in the URL**. In my parser, which is still under development, I assume userinfo section don't exist, and continue parsing, and if I see '@', I rollback; I do this to put the slow path on URLs that have user info. It's a trade-off to make parsing the normal URLs faster. Even Ada-URL which is used in nodejs has TODO about this. So, yeah, you're using domain-names-only, you're parsing less things than when you're parsing user-info and port and password as well. Even though for small URLs which is most of the URLs you wouldn't notice it without benchmarking. Yes, there are a lot of this random rules that has caused URL parsers to double-pass or triple-pass through pieces of URLs to figure things out. Another random one is that newlines and tabs for some reason need to be removed from a string. For example, spaces in domains are not okay, but newlines are: ```js > new URL("http://example.com") URL { href: 'http://example.com/', origin: 'http://example.com', protocol: 'http:', username: '', password: '', host: 'example.com', hostname: 'example.com', port: '', pathname: '/', search: '', searchParams: URLSearchParams {}, hash: '' } > new URL("http://exa mple.com") Uncaught TypeError: Invalid URL at new URL (node:internal/url:828:25) { code: 'ERR_INVALID_URL', input: 'http://exa mple.com' } > new URL("http://exa\nmple.com") URL { href: 'http://example.com/', origin: 'http://example.com', protocol: 'http:', username: '', password: '', host: 'example.com', hostname: 'example.com', port: '', pathname: '/', search: '', searchParams: URLSearchParams {}, hash: '' } ``` All of these things have costs, you can't do `.split`s on raw URLs and expect things to work. WHATWG URL is far more complicated than that. But, with the second solution that I proposed, by using tagged-sub-domains, you can safely use `.split` on your URLs after you've got the domain out of it using WHATWG-compatible URL parser. Even though this code is wasteful, but it is a very clean and understandable code, that is actually work no matter what you throw at it: ```js const url = new URL("...").hostname.split('.'); const ua = url.find(a => a.startsWith("ua--")); const us = url.find(a => a.startsWith("us--")); // ... ``` --- > We can't use - since it's url safe and people may register host--tw3.eth as a domain for malicious reasons so let's use : You can't register `us--` or `ua--` domains AFAIK (but URL parsers accept it), but you can with `host--tw3`, but you can use them as sub-domains but you shouldn't. Again, there are other solutions as well. Don't use chain id, as a port number just because both of them are numbers; that's a random rule. -- Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/883#issuecomment-3369134129 You are receiving this because you are subscribed to this thread. Message ID: <whatwg/url/issues/883/3369134129@github.com>
Received on Sunday, 5 October 2025 15:32:05 UTC