Re: [whatwg/url] Issues with UTS #46 tests (#341) from Karl on 2022-05-06 (public-webapps-github@w3.org from May 2022)

From: Karl <notifications@github.com>
Date: Thu, 05 May 2022 18:58:47 -0700
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/341/1119193904@github.com>

Hmm, I also found what I believe to be an ambiguity in UTS46, which is causing actual implementation divergence, so I sent a Unicode Error Report for that to be clarified.

I think it underscores why integrating these tests in to the WPT is important, and aligns with the overall goal of minimising divergence and promoting interoperability. As we well know, the standards are not perfect, implementations are not perfect, but having everybody use the same test-suite and ensuring that they all run it is a good way to catch imperfections in both the standards and the implementations early.

The issues that I'm seeing make me think that perhaps not everybody is running the tests, or they are running older versions, or they have hacks to exclude certain buggy tests, or (and this one can be really subtle) they are testing a _different code-path_ (via a different set of flags) to what is actually used by the URL Standard. Integrating with the WPT would help guard against those issues, and make it simpler even for non-WPT implementations to match web behaviour.

Here's the report I sent on the (possible) UTS46 ambiguity:

<details>
<summary>> Expand: Unicode Error Report</summary>

UTS 46
Version 14.0.0
Date 2021-08-24
Revision 27
https://www.unicode.org/reports/tr46/

I only just started writing my own implementation of this recently, so apologies if I'm misunderstanding, but there are two locations where code-points are checked. Using the same format as the IdnaTestV2.txt file for describing those locations, they would be P1 and V6.

- P1 is applied to the entire domain, as given. So it may see (decoded) Unicode text, or Punycode. It takes the value of UseSTD3ASCIIRules in to account, so a domain like "≠ᢙ≯.com" triggers the error at P1 only if UseSTD3ASCIIRules=true. "xn--jbf911clb.com" will never trigger the error at this location, regardless of UseSTD3ASCIIRules, because it is encoded as ASCII.

- V6 is applied to the result of Punycode-decoding a domain label, so it will only see decoded Unicode text. As written, it would appear **not** to take UseSTD3ASCIIRules in to consideration, meaning that both "≠ᢙ≯.com" and "xn--jbf911clb.com" would trigger errors at this location, regardless of UseSTD3ASCIIRules.

Here is the text of Section 4.1, Validity Criteria (https://www.unicode.org/reports/tr46/#Validity_Criteria), Step 6:

> Each code point in the label must only have certain status values according to Section 5, IDNA Mapping Table:
> - For Transitional Processing, each value must be valid.
> - For Nontransitional Processing, each value must be either valid or deviation.

It is not clear whether these status values are supposed to take the value of UseSTD3ASCIIRules in to account. As described above, if V6 does not consider UseSTD3ASCIIRules, "≠ᢙ≯.com" and "xn--jbf911clb.com" will always be invalid domains. It does not matter that P1 considers UseSTD3ASCIIRules, because it will be caught by V6 later anyway.
This leads me to believe that it **should** respect UseSTD3ASCIIRules (otherwise the parameter would be meaningless). But it's not clear.

I'll have to apologise again because I am not very familiar with the codebases I am about to cite, but from what I can glean this is leading to divergence in the wild:

- Unicode-org implementation of IDNA not appear to consider UseSTD3ASCIIRules here:
https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/idna/Uts46.java#L610-L625

- This seems to be confirmed by the IdnaTestV2 file. For example, Version 14.0.0 (Date: 2021-08-17, 19:34:01 GMT) lines 571 and 573:

```
[571] xn--jbf911clb.xn----p9j493ivi4l; ≠ᢙ≯.솣-ᡴⴀ; [V6]; xn--jbf911clb.xn----p9j493ivi4l; ; ; # ≠ᢙ≯.솣-ᡴⴀ
[573] xn--jbf911clb.xn----6zg521d196p; ≠ᢙ≯.솣-ᡴႠ; [V6]; xn--jbf911clb.xn----6zg521d196p; ; ; # ≠ᢙ≯.솣-ᡴႠ
```

"V6" is not an optional validation step tied to any parameter; implementations cannot decide whether or not it applies to them. It always applies, and these tests are saying these domains should always be considered invalid IIUC.

- JSDOM implementation does consider UseSTD3ASCIIRules here, thinks this domain is valid:
https://github.com/jsdom/tr46/blob/e937be8d9c04b7938707fc3701e50118b7c023a5/index.js#L100

- Browsers effectively do the same in URLs. Safari 15 and JSOM both consider "http://≠ᢙ≯.com.xn--jbf911clb" to be a perfectly fine URL:
https://jsdom.github.io/whatwg-url/#url=aHR0cDovL+KJoOGimeKJry5jb20ueG4tLWpiZjkxMWNsYg==&base=YWJvdXQ6Ymxhbms=

</details>

--
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/341#issuecomment-1119193904

You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/341/1119193904@github.com>

Received on Friday, 6 May 2022 01:59:00 UTC