[whatwg] Comments on the definition of a valid e-mail address

On Sun, 23 Aug 2009, Aryeh Gregor wrote:
>
> Section 4.10.4.1.5 defines a valid e-mail address as follows:
> 
> "A valid e-mail address is a string that matches the production 
> dot-atom-text "@" dot-atom-text where dot-atom-text is defined in RFC 
> 5322 section 3.2.3. [RFC5322]"
> 
> This is much more restrictive than the full range of e-mail addresses 
> allowed by RFC 5322 et al.  I've been considering whether to use <input 
> type=email> in MediaWiki, and whether to change our server-side e-mail 
> address validation to match.  Historically, MediaWiki has mostly just 
> required that an @ symbol be present in the address. Originally we used 
> a simplistic regex, but when users complained, we looked into the RFCs 
> and decided it was too complicated to bother with validation beyond 
> checking for an @ sign.
> 
> So before switching us over, I decided to do some research on how many 
> users' addresses would be invalidated.  I used the database for the 
> English Wikipedia.  Over all registered users, I found 3,088,880 
> confirmed addresses, not necessarily all distinct.  ("Confirmed" here 
> means that in theory, modulo bugs, the user followed a confirmation link 
> in the e-mail they received, so the address probably works in practice.)  
> Of those, 3,255 (~0.1%) failed HTML 5 validation, as determined using 
> the following regex-based database query:
> 
> root at rosemary:enwiki> SELECT COUNT(*) FROM user WHERE
> user_email_authenticated IS NOT NULL AND user_email NOT REGEXP
> '^[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+)*@[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+)*$'
> AND user_email != '';
> +----------+
> | COUNT(*) |
> +----------+
> |     3255 |
> +----------+
> 1 row in set (16 min 10.80 sec)

Thanks for this research, this is exactly the kind of hard data that is 
most useful when writing the spec.


> (Someone please tell me if my regex doesn't match HTML 5 here.)

If we let 

   X = [-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+

...then the regexp is:

   ^X(\.X)*@X(\.X)*$

I believe this is correct, yes.


> Inspection showed that the overwhelming majority of the failures were 
> due to the presence of excess whitespace, often a single trailing space, 
> or a space inserted before or after the @ sign.  When I adjusted the 
> regex to ignore those failures, I got a smaller list, 202 (about 0.007% 
> of the total): [...]
> 
> Some of these were clearly wrong, and shouldn't have been confirmed to 
> begin with.  Some even didn't have an @ sign, so probably were submitted 
> in some window when we did no validation at all (and I have no idea how 
> they got confirmed).  Of the ones that possibly work, I identified two 
> major categories:
> 
> 1) Addresses in the form "foo <bar at baz.example>", or similar.  These 
> mostly match RFC 5322's name-addr production instead of addr-spec (some 
> have trailing semicolons, or are missing the initial <, etc.). I assume 
> these were copy-pasted from a mail application.

These are intentionally not allowed, since it is expected that the name 
will be taken from elsewhere, and the e-mail address will then be pasted 
into a template with along the lines of "$name <$email>".


> 2) Addresses with dots in incorrect places, in either the local part
> or the domain name part.  For instance, multiple consecutive dots, or
> leading/trailing dots.  These don't match RFC 5322 at all AFAICT, but
> I asked one of the users with an invalid address of the form
> <foo. at example.com>, and he said it worked fine for him.  GNU mail gave
> a syntax error when I tried to send mail to that address, but Gmail
> sent it without complaint, and the user received it successfully.

I've change the grammar to allow a trailing dot in the username part.


> I should also note that this was only the English Wikipedia, and it 
> might be that speakers of other languages are more prone to use other 
> types of addresses that don't meet HTML 5's specification.  When looking 
> at the Swedish and German databases, for instance, I found one or two 
> addresses that had apparently been confirmed but contained non-ASCII 
> characters.  I didn't know the users with those addresses, and I didn't 
> want to send them unsolicited mail, so I wasn't able to establish 
> whether those addresses actually worked or the confirmation was bogus.

I'll leave it as requiring ASCII for now; I expect UAs to do IDNA 
processing on the UI end for the domain side. I'm not sure what is 
supposed to happen on the username side.


> Conclusions: At a minimum, I suggest that HTML 5 require that user 
> agents strip all whitespace from e-mails, not just newlines.  Roughly 
> 0.1% of the addresses from my sample were valid except for extraneous 
> whitespace.  It's a small additional change that would cut the number of 
> illegitimately invalid addresses in my sample by a factor of more than 
> ten.

This is a UI issue -- if the user enters whitespace, the user agent is 
allowed to trim it. It won't submit with whitespace, so user agents are 
likely to want to do this.


> Beyond that, although it's safe to say that quoted-string or 
> domain-literal or even entirely invalid addresses are extraordinarily 
> rare, there are *some* real people who do use them.  Unless something is 
> so completely invalid that it's obviously impossible that any mail 
> server would even try to send it anywhere, you're probably going to be 
> cutting out some small number of users.

Do you have any more details on what types of addresses we need to allow?


> So why not have the spec say that in the case of e-mail addresses, the 
> browser may warn the user, but should permit them to submit the address 
> anyway?  If the user is willing to override the warning, then it's 
> likely that they personally know that the e-mail address works, e.g., 
> because they use it.

I dunno; your data had a number of "obviously wrong" e-mail addresses. I 
would expect users to just click through warnings without checking.


> Alternatively, you could just loosen the restrictions even further, and 
> only ban input that doesn't contain an @ sign.  (Or that doesn't match 
> ^[^@]+@[^@]+\.[^@]+$, or whatever.)  Or just don't ban anything at all, 
> like with type=tel.  type=email differs from most of the other types 
> with validity constraints (like month, number, etc.) in that the 
> difference between valid and invalid values is a purely pragmatic 
> question (what will actually work?) that the user can often answer 
> better than the application.  It doesn't seem like a good idea for the 
> standard to tell users that the e-mail addresses they've actually been 
> using are invalid.

I'm not quite ready to give up yet!


On Sun, 23 Aug 2009, Aryeh Gregor wrote:

> . . . and I should add that I think it might be useful to have an note 
> recommending that application authors not do any validation beyond what 
> the spec ends up mandating as required (preferably almost nothing).  
> I've had a lot of problems with sites that think + isn't valid in e-mail 
> addresses, including pretty major sites that should know better.  You 
> don't really know if it will work anyway until you try actually sending 
> mail to it -- maybe the local part was mistyped or invented -- so why 
> not just do that?

This is basically why I want the spec to define how you check for a valid 
e-mail address -- so that the authors won't do anything more than basic 
sanity checking.


On Sun, 23 Aug 2009, Tab Atkins Jr. wrote:
> 
> Unless you avoid validating *entirely*, there's virtually always going 
> to be some subset of theoretically valid addresses that you'll flag as 
> invalid, though.

I think it's more the theoretically invalid ones (that work anyway) that 
we're worried about.


On Mon, 24 Aug 2009, Aryeh Gregor wrote:
>
> The breakdown of the 202 is as follows.
> 
> * Single trailing dot in domain part: 100 (prohibited by RFC but
> plausibly deliverable)

Raising an error on these seems ok, the user almost certainly didn't mean 
the dot and can just remove it.


> * Single trailing dot in local part: 40 (prohibited by RFC but
> plausibly deliverable)

Now allowed.


> * Valid address in angle brackets (with other junk around it): 21
> (permitted by RFC, kind of, and plausibly deliverable)

Intentionally not allowed.


> * Multiple consecutive dots: 20 (prohibited by RFC but plausibly deliverable)

I've change the grammar to allow multiple dots in the username part.


> * No @: 9 (unlikely to be deliverable)
> * Comment: 3 (permitted by RFC and plausibly deliverable)

Intentionally not allowed.


> * Miscellaneous: 9 (one containing [NO]@[SPAM], two with trailing >,
> one in "quotes", one with single leading dot in local part, two with
> single leading comma in local part, one with leading ": ", one with
> leading "\")

All but the one with a "." are intentionally disallowed. The one with a 
leading "." is now allowed.

So I think that the spec is good now.


On Tue, 25 Aug 2009, TAMURA, Kent wrote:
> 
> http://www.whatwg.org/specs/web-apps/current-work/#e-mail-state
> > A valid e-mail address is a string that matches the production
> > dot-atom-text "@" dot-atom-text
> > where dot-atom-text is defined in RFC 5322 section 3.2.3.
> > [RFC5322]<http://www.whatwg.org/specs/web-apps/current-work/#refsRFC5322>
> 
> I'd like stricter rule for it. e.g.
> dot-atom-text "@" 1*(ALPHA / DIGIT) 1*("." 1*(ALPHA / DIGIT))
> 
> I understand the current production, dot-atom-text "@" dot-atom-text, is 
> a subset of addr-spec of RFC 5322.  However dot-atom-text for the 
> domain-part is not practical.  The production accepts apparently 
> unusable email address like "tkent@!!!!"

I've restricted the text after the "@" to domain label syntax only.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 30 August 2009 22:53:47 UTC