Re: [whatwg/url] Add Unicode ToASCII fallback for ASCII domains (PR #914) from Henri Sivonen on 2026-06-25 (public-webapps-github@w3.org from June 2026)

From: Henri Sivonen <notifications@github.com>
Date: Thu, 25 Jun 2026 02:33:16 -0700
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/pull/914/c4797799634@github.com>

hsivonen left a comment (whatwg/url#914)

> > It looks to me like domain to ASCII and domain to Unicode still reject different domains.
> 
> I don't really understand this feedback as these are quite different operations from the URL standard perspective. Domain to ASCII takes a string and turns it into failure or a domain, whereas domain to Unicode takes a domain and potentially transforms it into a Unicode form or keeps it as-is (and notably never rejects).

Is that kind of definition of 'domain to Unicode' good, though?

Before this PR, the URL Standard just said how to set the UTS 46 flags and provided an ASCII deny list that's not representable as UTS 46 flags. Arguably, it's a spec bug that in the spec currently, "domain to Unicode" with _beStrict_ set to false doesn't enforce the ASCII deny list. (In the `idna` crate, it does.)

This means that in the spec currently if you fix the spec bug of "domain to Unicode" with _beStrict_ set to false doesn't enforce the ASCII deny list and there's a DoS-avoidance length limit on labels that also has the effect of the Punycode encode step in UTS 46 ToASCII never failing, "domain to ASCII" and "domain to Unicode" accept/reject the same set of inputs. (AFAICT, in UTS 46, _VerifyDnsLength_ and overflows in Punycode encode are the possible sources of differences between "ToASCII" and "ToUnicode", and when _beStrict_ is false, as it normally is, the URL Standard passes false for _VerifyDnsLength_, and the Punycode encode failure issue doesn't arise with reasonable label lengths.)

So it's currently the case that both the processing mode that goes towards DNS resolution in a browser and the processing that goes towards the UI in a browser agree on what's rejected and what's accepted.

It seems to me that changing that opens the opportunity for bugs. I don't have an end-to-end scenario: Breaking the identity just looks on its face like and opportunity for stuff to go wrong.

Even if the new thing that's easy to define in the "domain to ASCII" case is harder to define in an equivalent way in the "domain to Unicode" case, I think it's worthwhile to make the effort define "domain to Unicode" in a way that makes both accept and reject the same set of inputs when _beStrict_ is false and the labels have reasonable length (what exactly reasonable length is is a bit fuzzy).

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/pull/914#issuecomment-4797799634
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/pull/914/c4797799634@github.com>

Received on Thursday, 25 June 2026 09:33:20 UTC