[Bug 18162] New: IDN email addresses should be converted to Punycode before validating them

https://www.w3.org/Bugs/Public/show_bug.cgi?id=18162

           Summary: IDN email addresses should be converted to Punycode
                    before validating them
           Product: HTML WG
           Version: unspecified
          Platform: Other
               URL: http://www.whatwg.org/specs/web-apps/current-work/#e-m
                    ail-state-(type=email)
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P3
         Component: HTML5 spec
        AssignedTo: dave.null@w3.org
        ReportedBy: contributor@whatwg.org
         QAContact: public-html-bugzilla@w3.org
                CC: ian@hixie.ch, duerst@it.aoyama.ac.jp, mike@w3.org,
                    public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org, w3.org@boblet.net,
                    public-i18n-core@w3.org, mathias@qiwi.be,
                    john.david.dalton@gmail.com, derek@websiteni.com,
                    steves_list@hotmail.com


This was was cloned from bug 15489 as part of operation convergence.
Originally filed: 2012-01-10 05:35:00 +0000

================================================================================
 #0   contributor@whatwg.org                          2012-01-10 05:35:49 +0000 
--------------------------------------------------------------------------------
Specification:
http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html
Multipage: http://www.whatwg.org/C#e-mail-state-(type=email)
Complete: http://www.whatwg.org/c#e-mail-state-(type=email)

Comment:
Email addresses should be converted from Punycode to ASCII before validating
them

Posted from: 78.20.165.163 by mathias@qiwi.be
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.16
(KHTML, like Gecko) Chrome/18.0.1000.0 Safari/535.16
================================================================================
 #1   Mathias Bynens                                  2012-01-10 05:43:28 +0000 
--------------------------------------------------------------------------------
The spec currently says:

> A valid e-mail address is a string that matches the ABNF production
> 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> 3.5. [ABNF] [RFC5322] [RFC1034]

As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884)
it even includes an example regular expression:

> /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/

This makes IDN email addresses like `foo@mañana.com` invalid, even though its
ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

It’s probably not a good idea to force users to enter their IDN email addresses
in Punycode format. How about defining that UAs should convert any IDN email
address input to its Punycoded ASCII equivalent before validating email
addresses (by applying this regex, for example)?
================================================================================
 #2   Mathias Bynens                                  2012-01-10 05:53:53 +0000 
--------------------------------------------------------------------------------
Here’s a simple test case for how current browsers implement this:
http://jsbin.com/acomah

The first input field (1): <input type=email value=foo@mañana.com>
The second input field (2): <input type=email value=foo@xn--maana-pta.com>

In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no
Punycode conversion is done at all.
Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it
becomes valid. Opera does Punycode conversion in the background; both fields
display the value as “foo@mañana.com”.

Ideally, both fields would be marked as valid, as is the case in Opera after
you focus 1.
================================================================================
 #3   Derek Johnson                                   2012-01-10 10:42:16 +0000 
--------------------------------------------------------------------------------
(In reply to comment #2)

> In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no
> Punycode conversion is done at all.
> Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
> In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it
> becomes valid. Opera does Punycode conversion in the background; both fields
> display the value as “foo@mañana.com”.

In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”,
1 displays it as "foo@xn--maana-pta.com"
================================================================================
 #4   Mathias Bynens                                  2012-01-10 10:43:34 +0000 
--------------------------------------------------------------------------------
(In reply to comment #3)
> In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”,
> 1 displays it as "foo@xn--maana-pta.com"


So IE10pre matches Safari 5.1.2, Firefox 9 and Chrome 16.
================================================================================
 #5   Michael[tm] Smith                               2012-01-10 14:14:21 +0000 
--------------------------------------------------------------------------------
As far as I can tell, many (most?) mail clients don't recognize IDN email
addresses and don't let you enter them into their UIs (e.g, into a To field) --
in particular, Web-based mail clients (Gmail for one).

Given that, it would maybe not be helpful to enable users to enter IDN email
addresses into validated form fields in Web apps until we are at the point
where more existing mail clients that are in common use actually also enable
that.
================================================================================
 #6   Michael[tm] Smith                               2012-01-12 01:49:17 +0000 
--------------------------------------------------------------------------------
Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec
actually says, "User agents may transform the values for display and editing;
in particular, user agents should convert punycode in the value to IDN in the
display and vice versa."

So the spec is already stating what you want, right? That is, that IDN email
addresses should be converted to Punycode before validating them.
================================================================================
 #7   Mathias Bynens                                  2012-01-12 06:54:28 +0000 
--------------------------------------------------------------------------------
(In reply to comment #6)
> Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec
> actually says, "User agents may transform the values for display and editing;
> in particular, user agents should convert punycode in the value to IDN in the
> display and vice versa."
> 
> So the spec is already stating what you want, right? That is, that IDN email
> addresses should be converted to Punycode before validating them.

The spec only mentions “for display and editing” (nothing about validation),
and uses “may” — not “must”.
================================================================================
 #8   Michael[tm] Smith                               2012-01-12 11:27:29 +0000 
--------------------------------------------------------------------------------
(In reply to comment #7)
> The spec only mentions “for display and editing” (nothing about validation),
> and uses “may” — not “must”.

Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion
applies to user input only, and not to the contents of the "value" attribute.
That is, IDN e-mail addresses in the value attribute are invalid per the spec,
intentionally. For his rationale, see
http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312
================================================================================
 #9   Mathias Bynens                                  2012-01-12 11:51:12 +0000 
--------------------------------------------------------------------------------
(In reply to comment #8)
> Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion
> applies to user input only, and not to the contents of the "value" attribute.

That would explain Opera’s behavior in the above test case; when focusing the
input field, the state changes to the “user input” state, so the email address
becomes valid.

> That is, IDN e-mail addresses in the value attribute are invalid per the spec,
> intentionally. For his rationale, see
> http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Why is that? Because IDN email addresses are considered to be invalid?
================================================================================
 #10  Ian 'Hixie' Hickson                             2012-02-03 06:44:37 +0000 
--------------------------------------------------------------------------------
(In reply to comment #0)
>
> Email addresses should be converted from Punycode to ASCII before validating
> them

Assuming you mean user input, that's what the spec says to do.


(In reply to comment #1)
> The spec currently says:
> 
> > A valid e-mail address is a string that matches the ABNF production
> > 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> > in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> > 3.5. [ABNF] [RFC5322] [RFC1034]
> 
> As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884)
> it even includes an example regular expression:
> 
> > /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/
> 
> This makes IDN email addresses like `foo@mañana.com` invalid, even though its
> ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

Yes. Note that the regular expression is irrelevant here, it's not normative.
IDN e-mail addresses have always been invalid here. This shouldn't affect
users, since any IDN e-mail addresses they enter should get converted to ASCII
before being used as the new value (which is what is validated).


> It’s probably not a good idea to force users to enter their IDN email addresses
> in Punycode format.

Agreed. The spec doesn't ask them to.


> How about defining that UAs should convert any IDN email
> address input to its Punycoded ASCII equivalent before validating email
> addresses (by applying this regex, for example)?

That's already what the spec suggests browsers do.


(In reply to comment #9)
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Why is that?

At the wire level, e-mails are sent using punycoded addresses. IDN addresses
are only a rendering-level thing.


> Because IDN email addresses are considered to be invalid?

I'm not sure what this means. Invalid by whom, in what context?
================================================================================
 #11  Mathias Bynens                                  2012-02-03 09:30:47 +0000 
--------------------------------------------------------------------------------
So what should happen when markup like this is used:

    <input type=email value=foo@mañana.com>

Should this value be considered invalid until the user focuses the control
(i.e., until it becomes “user input”)? That seems weird.

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Let’s say Page A has the following markup. After submission the input is
inserted into a database.

    <input type=text name=email>
    <!-- or even a typo, which makes it fall back to type=text… -->
    <input type=e-mail name=email>

Page B uses type=email, and reads the value from the database:

    <input type=email value=foo@mañana.com>

Alternatively, the un-Punycoded email address may already be stored in the
database for a variety of reasons.
================================================================================
 #12  Ian 'Hixie' Hickson                             2012-02-08 23:08:27 +0000 
--------------------------------------------------------------------------------
(In reply to comment #11)
> So what should happen when markup like this is used:
> 
>     <input type=email value=foo@mañana.com>
> 
> Should this value be considered invalid until the user focuses the control
> (i.e., until it becomes “user input”)?

The markup is invalid, regardless of what the user does.

The form control itself initially has an invalid state. What happens after that
is up to the user agent. A user agent could pretend that the user had changed
the value, setting the internal value to "foo@ xn--maana-pta.com". Or it could
wait for the user to actually make a change to the value. Or it could never
support IDN.


> That seems weird.
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Let’s say Page A has the following markup. After submission the input is
> inserted into a database.
> 
>     <input type=text name=email>
>     <!-- or even a typo, which makes it fall back to type=text… -->
>     <input type=e-mail name=email>

Then, if the user enters an IDN address, and the server doesn't validate its
input (!), the server will be in a state where if it tries to send mail, it
will fail.


> Page B uses type=email, and reads the value from the database:
> 
>     <input type=email value=foo@mañana.com>

This means the server is non-conforming, as it outputs invalid HTML.


> Alternatively, the un-Punycoded email address may already be stored in the
> database for a variety of reasons.

Like what?
================================================================================
 #13  Mathias Bynens                                  2012-02-09 09:50:09 +0000 
--------------------------------------------------------------------------------
> The markup is invalid, regardless of what the user does.

Note to self:
http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html#e-mail-state-(type=email)
(this was new to me)

> > Let’s say Page A has the following markup. After submission the input is
> > inserted into a database.
> > 
> >     <input type=text name=email>
> >     <!-- or even a typo, which makes it fall back to type=text… -->
> >     <input type=e-mail name=email>
> 
> Then, if the user enters an IDN address, and the server doesn't validate its
> input (!), the server will be in a state where if it tries to send mail, it
> will fail.

This assumes that the mail server / client can’t handle IDN email addresses.

> > Page B uses type=email, and reads the value from the database:
> > 
> >     <input type=email value=foo@mañana.com>
> 
> This means the server is non-conforming, as it outputs invalid HTML.

This bug is about making it conforming.

> > Alternatively, the un-Punycoded email address may already be stored in the
> > database for a variety of reasons.
> 
> Like what?

You could have imported a database (say, contact details of all your clients)
from a desktop app that allowed IDN emails.

This restriction in the spec forces web developers to implement their own
Punycode encoder on the back-end, even though browsers already have one
built-in. By lifting this restriction, authors would only need to validate the
email addresses on input in the back-end (as is the case anyway).
================================================================================
 #14  Ian 'Hixie' Hickson                             2012-02-09 19:46:54 +0000 
--------------------------------------------------------------------------------
Punycode encoders are available off-the-shelf, that's really not a big problem.

You'll need one anyway before you can send mail, since SMTP isn't IDN-aware.

IDN is only a rendering-level/UI-level feature.
================================================================================
 #15  Norbert Lindenberg                              2012-05-14 23:26:58 +0000 
--------------------------------------------------------------------------------
I don't agree with the statement "IDN is only a rendering-level/UI-level
feature", and think that internationalized domain names should be allowed in
email addresses in the value attribute of <input> elements.

IDNA (its full name, with the "A" standing for "applications") was designed to
enable the use of full Unicode in domain names within applications, while
providing a mapping to an ASCII form for use with older protocols that aren't
IDNA-aware (e.g., DNS and SMTP).

Applications generally benefit from using the plain Unicode form of strings
wherever possible. Older protocols and file formats require a variety of
ASCII-based transformations of Unicode - e.g., the string "中国" might show up as
"xn--fiqs8s", "%E4%B8%AD%E5%9B%BD", "\u4E2D\u56FD", "&#20013;&#22269;". Keeping
these around and storing them in databases tends to cause problems - searching
and sorting don't work properly because comparison functions don't know that
"xn--fiqs8s" and "%E4%B8%AD%E5%9B%BD" mean the same, and duplicate or missing
decoding later on can lead to mojibake. To maintain sanity, applications are
better off converting text to plain Unicode when they receive it, and
converting it to the appropriate ASCII-based transformations only when passing
it on to a service that doesn't support Unicode (such as addresses for SMTP).

The question here then is whether the email address in the value attribute of
the <input> element with type=email should be part of the Unicode-aware
application world, or part of the dumb ASCII-only protocol world. In a similar
situation, it's already been decided that the URLs in the href attribute of the
<a> and <link> elements, as well as the src attributes of the <script> and
<img> elements, can be IRIs and thus include internationalized domain name
labels.

I don't see why the same shouldn't be allowed for the value attribute of the
<input> element with type=email.

As a consequence, user agents then *must* convert email addresses that contain
IDN labels to the equivalent ASCII form before validating the addresses based
on their ASCII form specification.

Note also that the usage of the word "punycode" in the spec is wrong - Punycode
is just one function of several used in the conversion from a U-label to an
A-label:
http://tools.ietf.org/html/rfc5890#section-2.3.4
================================================================================
 #16  Martin D                                        2012-05-15 08:37:50 +0000 
--------------------------------------------------------------------------------
The discussion up to now seems to completely ignore the fact that Internet mail
is moving to UTF-8 throughout, including the left-hand side (LHS), and
including SMTP on the wire. See the work of the IETF EAI WG, in particular
http://tools.ietf.org/html/rfc6530, http://tools.ietf.org/html/rfc6531, and
http://tools.ietf.org/html/rfc6532.

That means that while the U-Label in www.mañana.com, when resolved as a domain
name, has to be converted at some point (as close as possible or inside the
actual resolver library) to an A-Label (punycode), an email address such as
résumés@mañana.com will go to an SMTP submission server AS SUCH, in UTF-8.

[At some point in the relay chain of course an SMTP server will have to look up
MX,... records for mañana.com, and there, a DNS packet will contain
xn--maana-pta rather than mañana, but there is no equivalent of punycode or
A-Label for the LHS whatsoever.]

While this will still take some time for implementation and deployment, and
this is expected to happen faster in some areas of the world than others, it
would be quite smart and helpful if HTML came up with a solution that deals
with non-ASCII in the LHS, too, and that wouldn't look totally antiquated in 5
or 10 years (or maybe even earler; even the infamous Sendmail these days is
8-bit clean, which means that implementing EAI is rather straightforward).
================================================================================

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Wednesday, 18 July 2012 17:29:55 UTC