- From: Andrew Raffensperger <notifications@github.com>
- Date: Thu, 15 Jun 2023 01:06:15 -0700
- To: whatwg/url <url@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/url/issues/776/1592565775@github.com>
* There are 3K+ RGI emoji and 1/3 of them involve ZWJ sequences. CheckJoiners exchanges few exotic characters (that can easily be enforced at the registrar level) for 1350 emoji sequences that are used internationally by billions of people. * RFC 5892 is both outdated (2010) and misguided. AFAICT it's trying to allow ZW(N)J for typographical reasons yet I don't think there's any ambiguity with or without a joiner. * Are there any registrars that allow both virama with and without ZWNJ as separate names (no) * How many actual domains benefit from this rule? * If you look across the internet, there are thousands of developer hours wasted on deciding these choices one way or another, but at the end of the day, CheckJoiners is just a convoluted way to disallow `200C` and `200D`. --- For a concrete example: `1F468 200D 1F4BB` ![image](https://github.com/whatwg/url/assets/225900/91845412-aebd-42ce-a9e9-c41e8549ff9b) * This emoji was released in 2016 (7 years ago) * Major browsers don't agree on it's validity: Compare Chrome/Brave vs Safari/Firefox * The [punycode](https://adraffy.github.io/punycode.js/test/demo.html#u=%F0%9F%91%A8%E2%80%8D%F0%9F%92%BB) of this emoji is `xn--1ugz855pfha` * This emoji is invalid with CheckJoiners. * In some browsers, this encodes as `xn--qq8hgf` which is [wrong](https://adraffy.github.io/punycode.js/test/demo.html#p=xn--qq8hgf) — `1F468 1F4BB` is not the same as `1F468 200D 1F4BB` * NodeJS recently switched to [Ada](https://github.com/ada-url/ada) which uses WHATWG. This means that even if you correctly punycode the domain, a WHATWG URL implementation will prevent its use, even though the punycode is valid and the domain is DNS compatible. ![image](https://github.com/whatwg/url/assets/225900/cf302aea-5df2-46d5-9fc1-3675762b4ef4) * In general, the validity of URLs seems to change randomly between browser releases as libraries are periodically replaced and the standards aren't clear. --- **The simplest solution is that `CheckJoiners` should be `false`** * Any name with a joiner is already punycode. * UTS-46 provides poor guidance regarding spoofs and confusables and has forced developers to implement various parts of UAX-39 and their own logic to decide when to display punycode as Unicode. * UTS-46 advice about validating punycode is also strange because name validity is a registrar problem, not a resolution problem. * This is a disaster for the end-user because the rules are constantly changing, yet at the same time, there are thousands confusables and mixed scripted spoofs that slip right through the implemented standards. --- For reference, I recently implemented a [normalization standard](https://github.com/adraffy/ensip-15/blob/master/ens-improvement-proposals/ensip-15-normalization-standard.md) for the [Ethereum Name Service](https://ens.domains/) ecosystem. I used a combination of UTS-51 + UTS-46 + significantly safer character set (banned punctuation, parens, brackets, vocalizations, obsolete, deprecated, ancient, reversed, turned, flipped, many ligatures, etc.) + an intelligent confusable system (that isn't just a warning system: eg. `rn` is a footgun confusable.) [Demo](https://adraffy.github.io/ens-normalize.js/test/resolver.html) | [Github](https://github.com/adraffy/ens-normalize.js) From my experience with the Unicode and RFC documentation, **the primary source of confusion and bugs is due to the documentation itself.** Many of these rules should be deprecated and the rules should be clarified and modernized. I think WHATWG made the correct decision with `AllowHyphens` and finally broke away from archaic DNS rules. I think they should do the same with `CheckJoiners`. If the WHATWG wants really wants to protect end-users, it should recommend UTS-51 RGI pre-processing and outright disallow ZW(N)J outside of emoji. -- Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/776#issuecomment-1592565775 You are receiving this because you are subscribed to this thread. Message ID: <whatwg/url/issues/776/1592565775@github.com>
Received on Thursday, 15 June 2023 08:06:22 UTC