[whatwg/url] Basic URL parse requires stripping tabs before host state is entered, allowing bad hosts (Issue #829)

### What is the issue with the URL Standard?

In this document:

https://url.spec.whatwg.org/#concept-basic-url-parser

Item 3 says:

> Remove all [ASCII tab or newline](https://infra.spec.whatwg.org/#ascii-tab-or-newline) from input.

After this it proceeds to describe how different parsing states should be processed and in `host state`/`hostname state` it states that a bad host should result in a parsing termination error (points 3 and 4):

> Let host be the result of [host parsing](https://url.spec.whatwg.org/#concept-host-parser) buffer with url [is not special](https://url.spec.whatwg.org/#is-not-special).

> If host is failure, then return failure.

In _host parsing_, it says that a forbidden code point should terminate parsing:

> If asciiDomain contains a [forbidden domain code point](https://url.spec.whatwg.org/#forbidden-domain-code-point), [domain-invalid-code-point](https://url.spec.whatwg.org/#domain-invalid-code-point) [validation error](https://url.spec.whatwg.org/#validation-error), return failure.

Finally, _forbidden host code point_ includes tab as an invalid character, which should fail URL parsing or a manufactured host name will be produced.

This ordering of stripping all tabs from a URL and then not allowing tabs in host names prevents host names from being validated properly (i.e. invalid characters are removed before they can be evaluated).

This has an immediate effect on some of the current libraries. For example Python's `urlsplit` will take `abc<tab>xyz.test` and will manufacture a host name `abcxyz.test`, which happens because they remove tabs from the URL, before having a chance to validate the host name.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/829
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/829@github.com>

Received on Friday, 2 August 2024 21:08:27 UTC