- From: Timothy Gu <notifications@github.com>
- Date: Wed, 16 Aug 2017 02:03:19 +0000 (UTC)
- To: whatwg/url <url@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/url/issues/341@github.com>
Wrote a letter to Unicode through their [feedback form](http://www.unicode.org/reporting.html) on July 30 with the details. Haven't heard back despite their promises that "You can expect an acknowledgement of your report within 2-3 business days." ``` Subject: Issues with UTS #46's conformance test file To whoever it may concern, While developing a product conforming to "UTS #46: Unicode IDNA Compatibility Processing, Version 10.0.0" [UTS46], we noticed a few issues with the provided conformance testing file (IdnaTest.txt). These issues are preventing us from implementing UTS #46 in tr46.js [TR46JS-ISSUE]. The IdnaTest.txt file is formatted as a list of semicolon-separated values. The meanings of the specific columns are given in UTS #46 Section 8.1, an excerpt of which is hereby reproduced [UTS46]: > No Field Description > ... > 3 toUnicode The result of applying toUnicode to the source, using > "nontransitional". > A blank value means the same as the source value; a value in > [...] is a set of error codes. > 4 toASCII The result of applying toASCII to the source, using the > specified type: T, N, or B. > A blank value means the same as the toUnicode value; a value in > [...] is a set of error codes. > > ... > > An error in toUnicode or toASCII is indicated by an error list of the form > [...]. In such a case, the contents of that list are error codes based on the > step numbers in UTS46 and IDNA2008: > > ... > An for Section 4.2 ToASCII, step n > ... Given that "An" applies only to the ToASCII algorithm, not the ToUnicode algorithm, it seems appropriate for field "toUnicode" in IdnaTest.txt to never have an error code of form An. Yet, in the published IdnaTest.txt file corresponding to version 10.0.0 [IDNA-TEST], there exist 305 entries in IdnaTest.txt where an "An" error code appears under "toUnicode". In particular, there exist 36 entries with _only_ an "An" error code under "toUnicode" -- which, in other words, means that the only justification for erroring on those entries from ToUnicode is not actually in ToUnicode. This is particularly troubling, since while the Standard allows for ADDITIONAL error cases than ones already specified in IdnaTest.txt, a product conforming to UTS #46 must produce an error on ALL error cases in IdnaTest.txt, per lines 68-72 of IdnaTest.txt, again reproduced below: > ... Thus to then verify conformance for the toASCII and toUnicode columns: > > - If the file indicates an error, the implementation must also have an error. > - If the file does not indicate an error, then the implementation must either > have an error, or must have a matching result. A close examination of the 36 entries mentioned above reveals that: - 9 of the 36 entries have only "[A3]" error code under ToUnicode, which corresponds to the Punycode-encoding step in ToASCII. The source domains all have one label with invalid Punycode-encoding though, so they would in fact have already recorded an error in no. 4 of Processing Steps, which is called upon by ToUnicode as well. In other words, these entries merely have a faulty error code; ToUnicode would still record an error for these entries, just one at a different step than advertised. Some samples from these 9 entries are: Line 313: B; xn--0.pt; [A3]; [A3] Line 315: B; xn--a-Ä.pt; [A3]; [A3] Line 316: B; xn--a-A\u0308.pt; [A3]; [A3] - The other 27 entries have only a "[A4_2]" error code under ToUnicode, which corresponds to the DNS length verification step under ToASCII. Some of them are: Line 201: B; 。; [A4_2]; [A4_2] Line 202: B; .; [A4_2]; [A4_2] Line 434: B; a..c; [A4_2]; [A4_2] Line 439: B; ä.\u00AD.c; [A4_2]; [A4_2] While these domain names are all rather unlikely to be allowed by real-world UTS #46 implementations, most (if not all) of them are still strictly allowed by ToUnicode as defined in UTS #46. Take line 201, for example. Step 1 of ToUnicode call into the Processing Steps, whose step 1 will map '。' to '.', and which will then pass through the rest of Processing Steps without recording an error. Step 2 of ToUnicode will then produce a "converted Unicode string" of '.', and signal there was no error. The 27 entries in IdnaTest.txt with [A4_2] are the real worrying ones, since they seem to go against the algorithms defined in UTS #46, and prevent us from creating a strict implementation of UTS #46 without passing its own conformance tests. To resolve these issues, I would like to see the following: - A clarification whether the aforementioned 27 entries should record an error in ToUnicode. - Corresponding changes to IdnaTest.txt or UTS #46 that accompany that clarification. - There be no entries in IdnaTest.txt with a ToUnicode error code that point to steps in ToASCII. Sincerely, Timothy Gu [UTS46]: http://www.unicode.org/reports/tr46/tr46-19.html [IDNA-TEST]: http://www.unicode.org/Public/idna/10.0.0/IdnaTest.txt [TR46JS-ISSUE]: https://github.com/Sebmaster/tr46.js/pull/13 ``` -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/341
Received on Wednesday, 16 August 2017 02:04:03 UTC