[whatwg/url] Issues with UTS #46 tests (#341) from Timothy Gu on 2017-08-16 (public-webapps-github@w3.org from August 2017)

From: Timothy Gu <notifications@github.com>
Date: Wed, 16 Aug 2017 02:03:19 +0000 (UTC)
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/341@github.com>
Wrote a letter to Unicode through their [feedback form](http://www.unicode.org/reporting.html) on July 30 with the details. Haven't heard back despite their promises that "You can expect an acknowledgement of your report within 2-3 business days."

```
Subject: Issues with UTS #46's conformance test file

To whoever it may concern,

While developing a product conforming to "UTS #46: Unicode IDNA Compatibility
Processing, Version 10.0.0" [UTS46], we noticed a few issues with the provided
conformance testing file (IdnaTest.txt). These issues are preventing us from
implementing UTS #46 in tr46.js [TR46JS-ISSUE].

The IdnaTest.txt file is formatted as a list of semicolon-separated values. The
meanings of the specific columns are given in UTS #46 Section 8.1, an excerpt
of which is hereby reproduced [UTS46]:

> No Field      Description
> ...
> 3  toUnicode  The result of applying toUnicode to the source, using
>               "nontransitional".
>               A blank value means the same as the source value; a value in
>               [...] is a set of error codes.
> 4  toASCII    The result of applying toASCII to the source, using the
>               specified type: T, N, or B.
>               A blank value means the same as the toUnicode value; a value in
>               [...] is a set of error codes.
>
> ...
>
> An error in toUnicode or toASCII is indicated by an error list of the form
> [...]. In such a case, the contents of that list are error codes based on the
> step numbers in UTS46 and IDNA2008:
>
>     ...
>     An for Section 4.2 ToASCII, step n
>     ...

Given that "An" applies only to the ToASCII algorithm, not the ToUnicode
algorithm, it seems appropriate for field "toUnicode" in IdnaTest.txt to never
have an error code of form An. Yet, in the published IdnaTest.txt file
corresponding to version 10.0.0 [IDNA-TEST], there exist 305 entries in
IdnaTest.txt where an "An" error code appears under "toUnicode". In particular,
there exist 36 entries with _only_ an "An" error code under "toUnicode" --
which, in other words, means that the only justification for erroring on those
entries from ToUnicode is not actually in ToUnicode.

This is particularly troubling, since while the Standard allows for ADDITIONAL
error cases than ones already specified in IdnaTest.txt, a product conforming
to UTS #46 must produce an error on ALL error cases in IdnaTest.txt, per lines
68-72 of IdnaTest.txt, again reproduced below:

> ... Thus to then verify conformance for the toASCII and toUnicode columns:
>
> - If the file indicates an error, the implementation must also have an error.
> - If the file does not indicate an error, then the implementation must either
>   have an error, or must have a matching result.

A close examination of the 36 entries mentioned above reveals that:

- 9 of the 36 entries have only "[A3]" error code under ToUnicode, which
  corresponds to the Punycode-encoding step in ToASCII. The source domains all
  have one label with invalid Punycode-encoding though, so they would in fact
  have already recorded an error in no. 4 of Processing Steps, which is called
  upon by ToUnicode as well. In other words, these entries merely have a faulty
  error code; ToUnicode would still record an error for these entries, just one
  at a different step than advertised.

  Some samples from these 9 entries are:

  Line 313: B; xn--0.pt; [A3]; [A3]
  Line 315: B; xn--a-Ä.pt; [A3]; [A3]
  Line 316: B; xn--a-A\u0308.pt; [A3]; [A3]

- The other 27 entries have only a "[A4_2]" error code under ToUnicode, which
  corresponds to the DNS length verification step under ToASCII. Some of them
  are:

  Line 201: B; 。; [A4_2]; [A4_2]
  Line 202: B; .; [A4_2]; [A4_2]
  Line 434: B; a..c; [A4_2]; [A4_2]
  Line 439: B; ä.\u00AD.c; [A4_2]; [A4_2]

  While these domain names are all rather unlikely to be allowed by real-world
  UTS #46 implementations, most (if not all) of them are still strictly allowed
  by ToUnicode as defined in UTS #46.
  
  Take line 201, for example. Step 1 of ToUnicode call into the Processing
  Steps, whose step 1 will map '。' to '.', and which will then pass through
  the rest of Processing Steps without recording an error. Step 2 of ToUnicode
  will then produce a "converted Unicode string" of '.', and signal there was
  no error.

The 27 entries in IdnaTest.txt with [A4_2] are the real worrying ones, since
they seem to go against the algorithms defined in UTS #46, and prevent us from
creating a strict implementation of UTS #46 without passing its own conformance
tests.

To resolve these issues, I would like to see the following:

- A clarification whether the aforementioned 27 entries should record an error
  in ToUnicode.
- Corresponding changes to IdnaTest.txt or UTS #46 that accompany that
  clarification.
- There be no entries in IdnaTest.txt with a ToUnicode error code that point to
  steps in ToASCII.

Sincerely,

Timothy Gu

[UTS46]: http://www.unicode.org/reports/tr46/tr46-19.html
[IDNA-TEST]: http://www.unicode.org/Public/idna/10.0.0/IdnaTest.txt
[TR46JS-ISSUE]: https://github.com/Sebmaster/tr46.js/pull/13
```

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/341
Received on Wednesday, 16 August 2017 02:04:03 UTC