[whatwg] Comments on the definition of a valid e-mail address from Aryeh Gregor on 2009-08-24 (public-whatwg-archive@w3.org from August 2009)

From: Aryeh Gregor <Simetrical+w3c@gmail.com>
Date: Sun, 23 Aug 2009 23:12:00 -0400
Message-ID: <7c2a12e20908232012g4952b70bj41071ac7ddf23fe6@mail.gmail.com>
On Sun, Aug 23, 2009 at 10:23 PM, Peter Kasting<pkasting at google.com> wrote:
> I think telling user agents to strip leading and trailing whitespace is a
> good idea. ?I'm not as sure about stripping whitespace in the middle.

It seems like (some?) mail agents will do that already, if it's around
a dot or the @.  If I do echo 'test' | mail '  simetrical  @  gmail .
com  ', I get the mail just fine, with the whitespace stripped from
the To: header.

> You said there were 202 rows total in this group. ?How many of those 202 are
> "ones that possibly work"?

I count 9 out of the 202 as missing an @ sign.  The other 193 look
like a more or less sensible address could be extracted from them.
Note that this isn't a fair sample of what users actually entered,
since in theory any address without an @ sign should have been
rejected on the server side for the past few years (but all other
addresses were allowed).

I should also reiterate that this isn't necessarily a representative
sample.  In particular, I wouldn't be surprised if some types of
invalidity (like use of non-ASCII characters, if that even slightly
works -- I haven't tested) were common in particular
non-English-speaking subsets of the Internet.

> I ask because if it is significantly less than 202, then the failure rate
> (if we strip whitespace) is noticeably less than 0.007% of your sample. ?I
> am not as firmly on the side of "never reject anything conceivably valid",
> probably because I think there's more of a chance of type=email obsoleting
> silly JS-based validators if we do it right.

I can definitely see the value in that.  On the other hand, if you're
one the people with a weird e-mail address, it would be a pain.  I
know one of the people whose local part ends in a ., as I mentioned.
I've been in that position myself with +-addressing.  It would be
great if we had some sane standard for what e-mail addresses actually
worked, but I'm not sure it's a great idea for HTML 5 to effectively
mandate that a subset of addresses are invalid unless we can get all
the people writing e-mail-related tools to go along.  (Which we
can't.)  *Some* people are being issued and are using these invalid
addresses, whether we like it or not.

> One notable datum missing from your otherwise useful analysis is how many
> _invalid_ email addresses not allowed by the current definition would be
> allowed by this. ?I suspect the number is large. ?I would be willing to
> trade a tiny number (<0.007%?) of false negatives to avoid a large number of
> false positives, especially since I suspect that if the check were weakened
> this far authors would be more likely to continue with their (currently
> lousy) hand-written validators.

One problem is that apparently some addresses are effectively usable
even though all the standards say they're wrong.  As I say, one of the
addresses was like <foo. at example.com>, with a trailing . in the local
part.  It's prohibited by the RFC and the GNU "mail" utility rejects
it, but the user with that address confirmed that he used it just fine
for a long time, and he received mail that I sent to that address with
Gmail.  Someone else I talked with about it found that two mail
servers he tested supported addresses like <"quoted
string"@example.com>, but a third didn't.

So it looks to me like there *is* no clear distinction between what's
usable as an e-mail address and what's not, in practice.  Some stuff
that the RFCs prohibit mostly works, and some stuff that they allow
doesn't reliably work.  Given that, the only reliable way to tell
whether an e-mail address is usable in practice is to just try it.
HTML 5 can't possibly distinguish between a working address and a
non-working address if it depends on what specific mail software the
parties happen to be using.

So given that either false negatives or false positives will
necessarily occur, you either lock out some users or you permit some
gibberish.  If the only reason to be strict is to encourage authors to
drop extremely broken JS checks in favor of slightly broken in-browser
checks, that doesn't strike me as very compelling, to be honest.
(Especially since I don't think it will necessarily work.)  The only
other reason I can think of is to help users avoid typos, but that's
something that overridable warnings are suited to, not outright
prohibitions.

> I don't think this is a very valuable option because I don't
> think a UA can make good UX out of it (I speak as a member
> of the Chromium team who works on UX).

What would the problem be here from a UX perspective?  I can see
problems from other perspectives, like how this creates a whole new
category of not-quite-valid input values that would have to be
specially treated in the spec.
Received on Sunday, 23 August 2009 20:12:00 UTC