- From: Ian Hickson <ian@hixie.ch>
- Date: Mon, 31 Aug 2009 05:53:47 +0000 (UTC)
On Sun, 23 Aug 2009, Aryeh Gregor wrote: > > Section 4.10.4.1.5 defines a valid e-mail address as follows: > > "A valid e-mail address is a string that matches the production > dot-atom-text "@" dot-atom-text where dot-atom-text is defined in RFC > 5322 section 3.2.3. [RFC5322]" > > This is much more restrictive than the full range of e-mail addresses > allowed by RFC 5322 et al. I've been considering whether to use <input > type=email> in MediaWiki, and whether to change our server-side e-mail > address validation to match. Historically, MediaWiki has mostly just > required that an @ symbol be present in the address. Originally we used > a simplistic regex, but when users complained, we looked into the RFCs > and decided it was too complicated to bother with validation beyond > checking for an @ sign. > > So before switching us over, I decided to do some research on how many > users' addresses would be invalidated. I used the database for the > English Wikipedia. Over all registered users, I found 3,088,880 > confirmed addresses, not necessarily all distinct. ("Confirmed" here > means that in theory, modulo bugs, the user followed a confirmation link > in the e-mail they received, so the address probably works in practice.) > Of those, 3,255 (~0.1%) failed HTML 5 validation, as determined using > the following regex-based database query: > > root at rosemary:enwiki> SELECT COUNT(*) FROM user WHERE > user_email_authenticated IS NOT NULL AND user_email NOT REGEXP > '^[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+)*@[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+)*$' > AND user_email != ''; > +----------+ > | COUNT(*) | > +----------+ > | 3255 | > +----------+ > 1 row in set (16 min 10.80 sec) Thanks for this research, this is exactly the kind of hard data that is most useful when writing the spec. > (Someone please tell me if my regex doesn't match HTML 5 here.) If we let X = [-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+ ...then the regexp is: ^X(\.X)*@X(\.X)*$ I believe this is correct, yes. > Inspection showed that the overwhelming majority of the failures were > due to the presence of excess whitespace, often a single trailing space, > or a space inserted before or after the @ sign. When I adjusted the > regex to ignore those failures, I got a smaller list, 202 (about 0.007% > of the total): [...] > > Some of these were clearly wrong, and shouldn't have been confirmed to > begin with. Some even didn't have an @ sign, so probably were submitted > in some window when we did no validation at all (and I have no idea how > they got confirmed). Of the ones that possibly work, I identified two > major categories: > > 1) Addresses in the form "foo <bar at baz.example>", or similar. These > mostly match RFC 5322's name-addr production instead of addr-spec (some > have trailing semicolons, or are missing the initial <, etc.). I assume > these were copy-pasted from a mail application. These are intentionally not allowed, since it is expected that the name will be taken from elsewhere, and the e-mail address will then be pasted into a template with along the lines of "$name <$email>". > 2) Addresses with dots in incorrect places, in either the local part > or the domain name part. For instance, multiple consecutive dots, or > leading/trailing dots. These don't match RFC 5322 at all AFAICT, but > I asked one of the users with an invalid address of the form > <foo. at example.com>, and he said it worked fine for him. GNU mail gave > a syntax error when I tried to send mail to that address, but Gmail > sent it without complaint, and the user received it successfully. I've change the grammar to allow a trailing dot in the username part. > I should also note that this was only the English Wikipedia, and it > might be that speakers of other languages are more prone to use other > types of addresses that don't meet HTML 5's specification. When looking > at the Swedish and German databases, for instance, I found one or two > addresses that had apparently been confirmed but contained non-ASCII > characters. I didn't know the users with those addresses, and I didn't > want to send them unsolicited mail, so I wasn't able to establish > whether those addresses actually worked or the confirmation was bogus. I'll leave it as requiring ASCII for now; I expect UAs to do IDNA processing on the UI end for the domain side. I'm not sure what is supposed to happen on the username side. > Conclusions: At a minimum, I suggest that HTML 5 require that user > agents strip all whitespace from e-mails, not just newlines. Roughly > 0.1% of the addresses from my sample were valid except for extraneous > whitespace. It's a small additional change that would cut the number of > illegitimately invalid addresses in my sample by a factor of more than > ten. This is a UI issue -- if the user enters whitespace, the user agent is allowed to trim it. It won't submit with whitespace, so user agents are likely to want to do this. > Beyond that, although it's safe to say that quoted-string or > domain-literal or even entirely invalid addresses are extraordinarily > rare, there are *some* real people who do use them. Unless something is > so completely invalid that it's obviously impossible that any mail > server would even try to send it anywhere, you're probably going to be > cutting out some small number of users. Do you have any more details on what types of addresses we need to allow? > So why not have the spec say that in the case of e-mail addresses, the > browser may warn the user, but should permit them to submit the address > anyway? If the user is willing to override the warning, then it's > likely that they personally know that the e-mail address works, e.g., > because they use it. I dunno; your data had a number of "obviously wrong" e-mail addresses. I would expect users to just click through warnings without checking. > Alternatively, you could just loosen the restrictions even further, and > only ban input that doesn't contain an @ sign. (Or that doesn't match > ^[^@]+@[^@]+\.[^@]+$, or whatever.) Or just don't ban anything at all, > like with type=tel. type=email differs from most of the other types > with validity constraints (like month, number, etc.) in that the > difference between valid and invalid values is a purely pragmatic > question (what will actually work?) that the user can often answer > better than the application. It doesn't seem like a good idea for the > standard to tell users that the e-mail addresses they've actually been > using are invalid. I'm not quite ready to give up yet! On Sun, 23 Aug 2009, Aryeh Gregor wrote: > . . . and I should add that I think it might be useful to have an note > recommending that application authors not do any validation beyond what > the spec ends up mandating as required (preferably almost nothing). > I've had a lot of problems with sites that think + isn't valid in e-mail > addresses, including pretty major sites that should know better. You > don't really know if it will work anyway until you try actually sending > mail to it -- maybe the local part was mistyped or invented -- so why > not just do that? This is basically why I want the spec to define how you check for a valid e-mail address -- so that the authors won't do anything more than basic sanity checking. On Sun, 23 Aug 2009, Tab Atkins Jr. wrote: > > Unless you avoid validating *entirely*, there's virtually always going > to be some subset of theoretically valid addresses that you'll flag as > invalid, though. I think it's more the theoretically invalid ones (that work anyway) that we're worried about. On Mon, 24 Aug 2009, Aryeh Gregor wrote: > > The breakdown of the 202 is as follows. > > * Single trailing dot in domain part: 100 (prohibited by RFC but > plausibly deliverable) Raising an error on these seems ok, the user almost certainly didn't mean the dot and can just remove it. > * Single trailing dot in local part: 40 (prohibited by RFC but > plausibly deliverable) Now allowed. > * Valid address in angle brackets (with other junk around it): 21 > (permitted by RFC, kind of, and plausibly deliverable) Intentionally not allowed. > * Multiple consecutive dots: 20 (prohibited by RFC but plausibly deliverable) I've change the grammar to allow multiple dots in the username part. > * No @: 9 (unlikely to be deliverable) > * Comment: 3 (permitted by RFC and plausibly deliverable) Intentionally not allowed. > * Miscellaneous: 9 (one containing [NO]@[SPAM], two with trailing >, > one in "quotes", one with single leading dot in local part, two with > single leading comma in local part, one with leading ": ", one with > leading "\") All but the one with a "." are intentionally disallowed. The one with a leading "." is now allowed. So I think that the spec is good now. On Tue, 25 Aug 2009, TAMURA, Kent wrote: > > http://www.whatwg.org/specs/web-apps/current-work/#e-mail-state > > A valid e-mail address is a string that matches the production > > dot-atom-text "@" dot-atom-text > > where dot-atom-text is defined in RFC 5322 section 3.2.3. > > [RFC5322]<http://www.whatwg.org/specs/web-apps/current-work/#refsRFC5322> > > I'd like stricter rule for it. e.g. > dot-atom-text "@" 1*(ALPHA / DIGIT) 1*("." 1*(ALPHA / DIGIT)) > > I understand the current production, dot-atom-text "@" dot-atom-text, is > a subset of addr-spec of RFC 5322. However dot-atom-text for the > domain-part is not practical. The production accepts apparently > unusable email address like "tkent@!!!!" I've restricted the text after the "@" to domain label syntax only. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Sunday, 30 August 2009 22:53:47 UTC