Re: "International" email addresses

--On Wednesday, November 19, 2014 11:00 +0100 Steven Pemberton
<Steven.Pemberton@cwi.nl> wrote:

> Dear i18n people,
> 
> XForms 1.1 has a regexp that is intended to help people
> inputting an email address by warning them if it is
> syntactically incorrect.
> http://www.w3.org/TR/xforms/#dt-email
> 
> We want to update it, and in particular include
> "international" email addresses. (Can't we replace
> "internationalization" with "deparochialization"?)

In this case, no, if only because, while local-script email
addresses are likely to bring significant benefits, they are
likely to increase what I'm sure you intend by "parochialism", 

> The XForms regexp is based on RFC 2822 "Internet Message
> Format", http://www.ietf.org/rfc/rfc2822.txt, the latest
> version of which http://tools.ietf.org/html/rfc5322, still
> only defines ascii addresses.

Yes.  You should also note that the address requirements in RFC
5321 (SMTP) are (intentionally) a little different from those in
5322.  If you are trying to do a user-helpful syntax check for
use on the public Internet, you should be looking at the
intersection of the 5321 and 5322 requirements.
 
> "Internationalized Email Headers"
> http://tools.ietf.org/html/rfc6532 updates rfc5322, apparently
> by adding all non-ascii UTF8 characters to the set allowed to
> be used in an email "atom"
> (http://tools.ietf.org/html/rfc6532#section-3.2,
> http://tools.ietf.org/html/rfc5322#section-3.2.3)

I don't have time to carefully check the text, so will leave
responding to that to the authors of 6532 and others.  I do
believe the text is clear, but have learned to not try to answer
such questions from memory.  But see below.

> So as far as I can see, an internationalised email address is:
> 
>   address: atom-list "@" atom-list.
>   atom-list: atom ( "." atom )*
>   atom: C+
>   C: any character in the world EXCEPT (),.:;<>@[\]

No, although it depends a bit on the level (or
comprehensiveness) at which you want to do your check.  if you
want maximum precision so that, e.g., you "pass" the minimum
number of addresses that other software will then reject as
invalid, then:

(i) You need to distinguish between the local-part and the
domain-part of the address, because the rules are different.

(ii) The domain-part must be either a valid, full-qualified,
domain corresponding to the "preferred syntax" of RFC 1034/1035
or the syntax rules of RFC 5321 (they are the same unless I or
the relevant WG screwed up badly) or must be valid IDN-style
domain name in which all non-ASCII labels are valid U-labels as
defined in RFC 5890ff.  The "valid U-label" requirement goes
beyond simple syntax that can be reduced to a regular expression.

(iii) For the conventional domain part, some of the characters
in your exclusion list are allowed even if quoting is needed.
And "." is just about required.

(iv) The rules for the local-part are quite different from those
of the domain part.  Independent of the comments above about
non-ASCII characters, most or all of the characters on your
exclusion list above are allowed although several of them must
be quoted.  Note that your list would exclude very common
constructions like joe.blow@example.com.

(v) Many of the combinations that are allowed represent bad
judgment.  Consequently, if you are going to make syntax tests,
it would be wise to devise different checks for, e.g., creation
of an email address (where "you could do that, but it would be
stupid and might prevent your getting mail from any but the most
careful of implementations" is an appropriate answer) and
systems preparing mail for sending (where the user should be
able to provide any target email address she has been told to
use by the potential recipient).

> a) Do you agree?

No.  See above.

> b) It was really hard to find this out. The internet is rife
> with people asking and getting bad answers. Please help the
> internet by being definitive.

The problem is not lack of definitiveness but two other problems:

* The system has evolved over time.  Attempts to fit newer
systems (starting with the DNS and evolving forward to these
non-ASCII addresses) into older systems often leaves rough
edges, edge cases, and rules whose motivations may not be clear
unless one knows the history.  The separate "accept" and
"produce" syntaxes in RFC 5322 are symptomatic of that problem.

* SMTP, in its present form and including RFC 6531, is really
applicable only to intersystem use on the Internet and largely
to the public Internet at that.  A number of decisions have been
made about RFCs 5322, 6532, etc. to preserve their
"gateway-friendly" role --i.e., compatibility with systems that
are not the public Internet and that have different constraints
-- and for use inside MUAs rather than only between systems.
See RFC 6055 for a discussion of some of those issues as related
to domain names.

regards,
    john
   (not speaking in any official capacity)

Received on Wednesday, 19 November 2014 10:38:11 UTC