- From: <bugzilla@jessica.w3.org>
- Date: Wed, 30 Mar 2016 09:38:47 +0000
- To: public-i18n-core@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 --- Comment #40 from Martin Dürst <duerst@it.aoyama.ac.jp> --- [sorry for the duplication of the last comment; there was some hickup with cookies] (In reply to Mathias Bynens from comment #36) > (In reply to Anne from comment #35) > > Mathias, any chance you could give us an updated regular expression? > > A generalized regex seems tricky to do, since each TLD can theoretically > have its own set of allowed symbols. Yes. And not only theoretically. In particular, country code TLDs usually only allow the symbols they are in one way or another familiar with. As an example, .jp restricts second-level IDNs to Japanese, which excludes dürst.jp. At lower levels, there may again be more or less restrictions. So the restriction at .jp would in no way make it impossible for me to set up a domain dürst.sw.it.aoyama.ac.jp, because I control sw.it.aoyama.ac.jp. But I think that shows that the only thing we can do sensibly is check against the restrictions given by the underlying protocol. Even for ASCII addresses, checking whether the address works, including checking whether the domain name actually exists, is done on the server side. > See > https://www.verisign.com/en_US/channel-resources/domain-registry-products/ > idn/idn-policy/registration-rules/index.xhtml for more info. > > The default list of allowed IDN symbols as used by Verisign (i.e. applies to > .com and then some) can be found here: > https://www.verisign.com/assets/allowedcode/idn-allowed-code-points.html They essentially list every single CJK ideograph on a separate line, a great waste of space and bandwidth. Similar for other scripts, although the waste there isn't that big. > Here’s a regex based on that: > https://github.com/mathiasbynens/idn-allowed-code-points-regex/blob/master/ > index.js I took a cursory look through that. The main reason that it's long is that it eliminates upper-case letters, which in many areas of Unicode come in pairs with lower case, leading to bad aggregation. This would bring in the question of whether it might be a good idea to have the browser apply the mapping rules (mostly lowercasing, but also potentially other stuff such as half-width kana -> full-width kana,...) that it uses for domain names in the address bar. BTW, it would be better to base your regexp on http://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties. I expect it to be mostly the same, but the later is more official. It might be interesting to look at the differences. Also, at the end of your regexp, I noted you used surrogate pairs explicitly. Ideally, we would write the regexp in terms of Unicode code points, but if this UTF-16-based notation is what is needed, I won't complain anymore. One last comment: please note that all the above applies to the right-hand side of the e-mail address (i.e. domain name part) only. On the left-hand side, for example, upper-case characters,... are allowed. -- You are receiving this mail because: You are on the CC list for the bug.
Received on Wednesday, 30 March 2016 09:38:51 UTC