[whatwg] Comments on the definition of a valid e-mail address from Aryeh Gregor on 2009-08-24 (public-whatwg-archive@w3.org from August 2009)

From: Aryeh Gregor <Simetrical+w3c@gmail.com>
Date: Mon, 24 Aug 2009 15:24:41 -0400
Message-ID: <7c2a12e20908241224i4f104686w22d5a6d6b6f960a2@mail.gmail.com>
On Mon, Aug 24, 2009 at 10:11 AM, Tab Atkins Jr.<jackalmage at gmail.com> wrote:
> Do these still have a normal TLD identifier before the trailing dot?
> Or are they just *really* weird?

None of the addresses had more than one thing wrong with it.  These
looked like perfectly normal addresses but with a trailing dot, like
"foo at example.com.".  I assume mailers just drop the trailing dot here.
 "example.com." is generally treated the same as "example.com" by
everything except the actual DNS protocol, AFAIK -- if you resolve
"example.com" the resolver will usually *append* the dot when it
actually makes the query.

> It seems that these are indeed valid in the wild, and so the algorithm
> should be loosened to allow these.

But the RFC forbids them.  If we're going to even allow things that
sort of work but which the RFC forbids, we may as well allow almost
anything, because who knows if it might work on some software?

> We need to see if these are actually deliverable.

I'd assume so.  In theory all of these should be deliverable.  The
ones without @ obviously aren't, but those all look to have been
confirmed back in 2006, so maybe there was a bug back then.  Addresses
with two or more consecutive dots have been confirmed as recently as
May 2009.

> What do you mean by this? ?Is it just fluff that doesn't affect the
> actual routing of the mail? ?If so, I'm fine with keeping them
> flagged, even if it is allowed by RFC.

I mean things like

bobsmith at example.com (use for new groups only)

If I'm reading the RFC correctly, the parenthesized part is a comment,
and is ignored (like whitespace).

On Mon, Aug 24, 2009 at 10:42 AM, Smylers<Smylers at stripey.com> wrote:
> For PHP I Googled "email validation Pear" and found the following as the
> top hit.  I haven't tried it, but it claims to comply to RFC822, and I'd
> have more faith in it than the average home-rolled attempt:
>
>  http://pear.php.net/package/Validate/

I stand corrected, assuming that's usable for people with only FTP
access.  (It looks like it is, at a glance, since it's seemingly pure
PHP.)  Given this, I'm not clear why there's a need to deviate from
the RFCs here.  I assume the burden on UA implementors wouldn't be all
that much.  Granted, many web developers seem not to be using these
validation libraries server-side, but I don't see how using different
standards for <input type=email> helps that.

> Yup.  If it is deliverable then surely it's an alias to the same address
> without the trailing dot, in which case a browser could choose to remove
> it.

Yes, it's not possible for "example.com." to mean anything different
from "example.com".  (In fact they do mean something different in DNS,
but "example.com." means the same thing as what "example.com" is
normally used to mean.  Moreover, the meaning of "example.com" in DNS
is basically nonsense for web apps processing user-submitted e-mail
addresses.  At least, as far as I understand it; I don't know too much
about DNS.)

> Discussed previously.  This seems to be the problematic category.

I wouldn't rule out the existence of other problematic categories that
happen not to have cropped up on the English Wikipedia.

> If you mean the ".."s are in the local part then yes, it sounds likely
> that would get delivered, and a quick non-exhaustive trial seemed to
> show this can work.
>
> (If they're in the hostname then I'd be amazed if it's deliverable, but
> surely it'd be to the same address that's reached by replacing sequences
> of dots to a single dot.)

Agreed.  Of course, they're all in the local part.

> They don't sound deliverable, or if they are would also be with
> superfluous punctuation stripped.  And I'm not sure single cases are
> worth fretting about.  If HTML 5 validation rejected one of the above it
> seems very likely the user would be able to provide an alternative
> address (or alternatively punctuated address) which is valid.

The one with a leading dot might be legitimate.  I'd imagine the
others are errors.

> There are two categories of input which could be a working e-mail
> address yet violate the RFCs:
>
>  1 A valid e-mail address with extra 'stuff' in it or surrounding it
>    (spaces, comments, trailing punctuation characters, etc).  As you
>    suggested, browsers can clean up the user's input, so what servers
>    receive is a valid e-mail address.
>
>  2 A working e-mail address which contains something the RFCs say it
>    shouldn't but needs that in order to function; attempting to clean
>    it up would transform it to a different e-mail address, which
>    possibly delivers somewhere differently from the original.
>
> Analysis of your detailed breakdown suggests the only addresses in
> category 2 are those with dots in odd places in the local part.
>
> So it may be the only change required to allow all working real-world
> e-mail addresses is a willful violation that permits dots anywhere in
> the local part (even immediately after another . or before the @).
>
> That change would appear to cover the cases in your data, but others may
> have data which shows there are additional cases.

I might also be able to obtain more data.  I only analyzed the English
Wikipedia, not the several hundred other sites run by the Wikimedia
Foundation in >200 languages.  I'll see if I can get more info.


Anyway, as far as I can think of, there are two use cases for <input
type=email> validation:

1) To detect typos or other errors on the part of the user that will,
in practice, stop the address from working.  In this case, it would be
good to have immediate feedback so the user doesn't submit the info,
navigate away, and get confused when the site is unable to contact
them because the address is wrong.  For this purpose, we'd prefer to
call funny-looking addresses invalid even if technically they might
not be, just to be on the safe side.  However, there's no reason we
have to do more than warn the user for this use-case.

2) To help enforce uniformity.  We don't want e-mail addresses to work
in some places and not others, because that presents interoperability
problems.  For this purpose, we should outright reject bad addresses,
and should reject exactly the addresses that the RFCs prohibit (unless
de facto standards exist that are different).

Encouraging authors to stop using broken JS (or pattern="") validation
will be served by figuring out what purpose the validation is supposed
to serve, and making sure that <input type=email> meets that purpose.
I think that existing client-side JS validation is meant to address
use case (1) above, and if HTML 5 addresses that, JS validation will
become unnecessary.

I still don't see any reason from an author perspective to want any
RFC-compliant address to be rejected without the option for the user
to override it.
Received on Monday, 24 August 2009 12:24:41 UTC