[whatwg] Comments on the definition of a valid e-mail address from Smylers on 2009-08-24 (public-whatwg-archive@w3.org from August 2009)

From: Smylers <Smylers@stripey.com>
Date: Mon, 24 Aug 2009 15:42:53 +0100
Message-ID: <20090824144253.GA4286@stripey.com>
Aryeh Gregor writes:

> On Mon, Aug 24, 2009 at 4:36 AM, Smylers<Smylers at stripey.com> wrote:
> 
> > It's too complicated for most developers to roll their own
> > validation, but there are standard libraries available which get it
> > right.
> 
> Standard libraries available for all major languages?

I'd be surprised if they weren't.

> As far as I can tell from a quick search, the PHP standard library
> contains no e-mail validation routines before 5.2.0

Sorry, I meant there is a "library" (meaning additional to the core
language) available in a "standard" place (wherever that language's
libraries are typically found); I wasn't intending to claim that "the"
"standard library" of functionality which is part of a language's core
distribution would include it.

For PHP I Googled "email validation Pear" and found the following as the
top hit.  I haven't tried it, but it claims to comply to RFC822, and I'd
have more faith in it than the average home-rolled attempt:

  http://pear.php.net/package/Validate/

> > Forms on websites capturing users' e-mail addresses typically want
> > just the address part, prompting for the human-readable name in a
> > separate box, so I think HTML 5's <input type=email> not allowing
> > the above is helpful.
> 
> It might be more helpful if they stripped the part outside the angle
> brackets, but I agree that it's reasonable to just reject these.

Good point.  And that's largely a UI matter: either way the web server
doesn't receive a value with the outside clutter in it.

> The breakdown of the 202 is as follows.

Thanks for providing this.

> * Single trailing dot in domain part: 100 (prohibited by RFC but
>   plausibly deliverable)

Yup.  If it is deliverable then surely it's an alias to the same address
without the trailing dot, in which case a browser could choose to remove
it.

> * Single trailing dot in local part: 40 (prohibited by RFC but
>   plausibly deliverable)

Discussed previously.  This seems to be the problematic category.

> * Valid address in angle brackets (with other junk around it): 21
> (permitted by RFC, kind of, and plausibly deliverable)

Discussed above.

> * Multiple consecutive dots: 20 (prohibited by RFC but plausibly
>   deliverable)

If you mean the ".."s are in the local part then yes, it sounds likely
that would get delivered, and a quick non-exhaustive trial seemed to
show this can work.

(If they're in the hostname then I'd be amazed if it's deliverable, but
surely it'd be to the same address that's reached by replacing sequences
of dots to a single dot.)

> * No @: 9 (unlikely to be deliverable)

Indeed.

> * Comment: 3 (permitted by RFC and plausibly deliverable)

Equivalent to the angle bracket case above -- the address without the
comment could be extracted.

> * Miscellaneous: 9 (one containing [NO]@[SPAM], two with trailing >,
>   one in "quotes", one with single leading dot in local part, two with
>   single leading comma in local part, one with leading ": ", one with
>   leading "\")

They don't sound deliverable, or if they are would also be with
superfluous punctuation stripped.  And I'm not sure single cases are
worth fretting about.  If HTML 5 validation rejected one of the above it
seems very likely the user would be able to provide an alternative
address (or alternatively punctuated address) which is valid.

> > So it may actually be that there isn't a general problem here of
> > lots of real-world e-mail addresses which work but don't comply with
> > the RFCs; it may simply be the one case of ".@"?
> 
> No, that was just the example I chose because I knew that person
> personally, and so was able to confirm that the address actually
> worked.

There are two categories of input which could be a working e-mail
address yet violate the RFCs:

  1 A valid e-mail address with extra 'stuff' in it or surrounding it
    (spaces, comments, trailing punctuation characters, etc).  As you
    suggested, browsers can clean up the user's input, so what servers
    receive is a valid e-mail address.  

  2 A working e-mail address which contains something the RFCs say it
    shouldn't but needs that in order to function; attempting to clean
    it up would transform it to a different e-mail address, which
    possibly delivers somewhere differently from the original.

Analysis of your detailed breakdown suggests the only addresses in
category 2 are those with dots in odd places in the local part.

So it may be the only change required to allow all working real-world
e-mail addresses is a willful violation that permits dots anywhere in
the local part (even immediately after another . or before the @).

That change would appear to cover the cases in your data, but others may
have data which shows there are additional cases.

Smylers
Received on Monday, 24 August 2009 07:42:53 UTC