- From: John C Klensin <john-ietf@jck.com>
- Date: Wed, 19 Nov 2014 05:37:40 -0500
- To: Steven Pemberton <Steven.Pemberton@cwi.nl>, www-international@w3.org
- cc: Forms WG <public-forms@w3.org>
--On Wednesday, November 19, 2014 11:00 +0100 Steven Pemberton <Steven.Pemberton@cwi.nl> wrote: > Dear i18n people, > > XForms 1.1 has a regexp that is intended to help people > inputting an email address by warning them if it is > syntactically incorrect. > http://www.w3.org/TR/xforms/#dt-email > > We want to update it, and in particular include > "international" email addresses. (Can't we replace > "internationalization" with "deparochialization"?) In this case, no, if only because, while local-script email addresses are likely to bring significant benefits, they are likely to increase what I'm sure you intend by "parochialism", > The XForms regexp is based on RFC 2822 "Internet Message > Format", http://www.ietf.org/rfc/rfc2822.txt, the latest > version of which http://tools.ietf.org/html/rfc5322, still > only defines ascii addresses. Yes. You should also note that the address requirements in RFC 5321 (SMTP) are (intentionally) a little different from those in 5322. If you are trying to do a user-helpful syntax check for use on the public Internet, you should be looking at the intersection of the 5321 and 5322 requirements. > "Internationalized Email Headers" > http://tools.ietf.org/html/rfc6532 updates rfc5322, apparently > by adding all non-ascii UTF8 characters to the set allowed to > be used in an email "atom" > (http://tools.ietf.org/html/rfc6532#section-3.2, > http://tools.ietf.org/html/rfc5322#section-3.2.3) I don't have time to carefully check the text, so will leave responding to that to the authors of 6532 and others. I do believe the text is clear, but have learned to not try to answer such questions from memory. But see below. > So as far as I can see, an internationalised email address is: > > address: atom-list "@" atom-list. > atom-list: atom ( "." atom )* > atom: C+ > C: any character in the world EXCEPT (),.:;<>@[\] No, although it depends a bit on the level (or comprehensiveness) at which you want to do your check. if you want maximum precision so that, e.g., you "pass" the minimum number of addresses that other software will then reject as invalid, then: (i) You need to distinguish between the local-part and the domain-part of the address, because the rules are different. (ii) The domain-part must be either a valid, full-qualified, domain corresponding to the "preferred syntax" of RFC 1034/1035 or the syntax rules of RFC 5321 (they are the same unless I or the relevant WG screwed up badly) or must be valid IDN-style domain name in which all non-ASCII labels are valid U-labels as defined in RFC 5890ff. The "valid U-label" requirement goes beyond simple syntax that can be reduced to a regular expression. (iii) For the conventional domain part, some of the characters in your exclusion list are allowed even if quoting is needed. And "." is just about required. (iv) The rules for the local-part are quite different from those of the domain part. Independent of the comments above about non-ASCII characters, most or all of the characters on your exclusion list above are allowed although several of them must be quoted. Note that your list would exclude very common constructions like joe.blow@example.com. (v) Many of the combinations that are allowed represent bad judgment. Consequently, if you are going to make syntax tests, it would be wise to devise different checks for, e.g., creation of an email address (where "you could do that, but it would be stupid and might prevent your getting mail from any but the most careful of implementations" is an appropriate answer) and systems preparing mail for sending (where the user should be able to provide any target email address she has been told to use by the potential recipient). > a) Do you agree? No. See above. > b) It was really hard to find this out. The internet is rife > with people asking and getting bad answers. Please help the > internet by being definitive. The problem is not lack of definitiveness but two other problems: * The system has evolved over time. Attempts to fit newer systems (starting with the DNS and evolving forward to these non-ASCII addresses) into older systems often leaves rough edges, edge cases, and rules whose motivations may not be clear unless one knows the history. The separate "accept" and "produce" syntaxes in RFC 5322 are symptomatic of that problem. * SMTP, in its present form and including RFC 6531, is really applicable only to intersystem use on the Internet and largely to the public Internet at that. A number of decisions have been made about RFCs 5322, 6532, etc. to preserve their "gateway-friendly" role --i.e., compatibility with systems that are not the public Internet and that have different constraints -- and for use inside MUAs rather than only between systems. See RFC 6055 for a discussion of some of those issues as related to domain names. regards, john (not speaking in any official capacity)
Received on Wednesday, 19 November 2014 10:38:12 UTC