[Bug 15489] forms: <input type=email> validation needs to be updated for EAI

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489

--- Comment #40 from Martin Dürst <duerst@it.aoyama.ac.jp> ---
[sorry for the duplication of the last comment; there was some hickup with
cookies]

(In reply to Mathias Bynens from comment #36)
> (In reply to Anne from comment #35)
> > Mathias, any chance you could give us an updated regular expression?
> 
> A generalized regex seems tricky to do, since each TLD can theoretically
> have its own set of allowed symbols.

Yes. And not only theoretically. In particular, country code TLDs usually only
allow the symbols they are in one way or another familiar with. As an example,
.jp restricts second-level IDNs to Japanese, which excludes dürst.jp.

At lower levels, there may again be more or less restrictions. So the
restriction at .jp would in no way make it impossible for me to set up a domain
dürst.sw.it.aoyama.ac.jp, because I control sw.it.aoyama.ac.jp.

But I think that shows that the only thing we can do sensibly is check against
the restrictions given by the underlying protocol. Even for ASCII addresses,
checking whether the address works, including checking whether the domain name
actually exists, is done on the server side.

> See
> https://www.verisign.com/en_US/channel-resources/domain-registry-products/
> idn/idn-policy/registration-rules/index.xhtml for more info.
> 
> The default list of allowed IDN symbols as used by Verisign (i.e. applies to
> .com and then some) can be found here:
> https://www.verisign.com/assets/allowedcode/idn-allowed-code-points.html

They essentially list every single CJK ideograph on a separate line, a great
waste of space and bandwidth. Similar for other scripts, although the waste
there isn't that big.

> Here’s a regex based on that:
> https://github.com/mathiasbynens/idn-allowed-code-points-regex/blob/master/
> index.js

I took a cursory look through that. The main reason that it's long is that it
eliminates upper-case letters, which in many areas of Unicode come in pairs
with lower case, leading to bad aggregation.

This would bring in the question of whether it might be a good idea to have the
browser apply the mapping rules (mostly lowercasing, but also potentially other
stuff such as half-width kana -> full-width kana,...) that it uses for domain
names in the address bar.

BTW, it would be better to base your regexp on
http://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties.
I expect it to be mostly the same, but the later is more official. It might be
interesting to look at the differences.

Also, at the end of your regexp, I noted you used surrogate pairs explicitly.
Ideally, we would write the regexp in terms of Unicode code points, but if this
UTF-16-based notation is what is needed, I won't complain anymore.

One last comment: please note that all the above applies to the right-hand side
of the e-mail address (i.e. domain name part) only. On the left-hand side, for
example, upper-case characters,... are allowed.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Wednesday, 30 March 2016 09:38:51 UTC