RE: "International" email addresses [I18N-ACTION-374] from Shawn Steele on 2014-11-27 (public-forms@w3.org from November 2014)

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Thu, 27 Nov 2014 06:42:27 +0000
To: "Phillips, Addison" <addison@lab126.com>, Steven Pemberton <Steven.Pemberton@cwi.nl>
CC: "www-international@w3.org" <www-international@w3.org>, Forms WG <public-forms@w3.org>
Message-ID: <CY1PR0301MB07315820748053E50823C1A982710@CY1PR0301MB0731.namprd03.prod.outlook.>
The EAI RFCs don’t say anything about the local part making sense in Unicode, so the first part, though nonsense, is permitted.  Presumably whomever is assigning mailboxes in that domain would use wiser rules though….

Presuming that the @ sign is actually interpreted as a delimiter per the RFC’s despite the unexpected combining mark, the domain part is invalid IDN, so that would be clear.

-Shawn

From: Phillips, Addison [mailto:addison@lab126.com]
Sent: Wednesday, November 26, 2014 2:20 PM
To: Steven Pemberton
Cc: www-international@w3.org; Forms WG
Subject: RE: "International" email addresses [I18N-ACTION-374]

Hi Steven,

I understand about the desire to limit yourself to the lexical space (which is something you can reasonably address and which has the most utility for what you’re working on).

I do have some concerns about your suggested syntax. While it certainly is consistent with what RFC6532 says, I’d be concerned that, for example, ‘atom’ can start with combining marks or consist solely of non-starting Unicode code points or other values that would be problematic. These are the sorts of problems described in [1] and [2]. That is, I’m pretty sure that the following Unicode code point sequence isn’t ever a valid email address, notwithstanding it’s apparent “lexical validity”:

U+0300 U+0301 U+FE0F U+0040 U+09C4 U+002E U+0063 U+006F U+006D

(that’s two combining accents, a variation selector, the @ sign, a Bengali combining vowel marker, “dot com”)

So I’d suggest that ‘atom’ at least always starts with a Unicode code point with a combining class of 0 (or possibly an unassigned code point for a given version of Unicode that might later been assigned a non-zero combining value).

Addison

[1] http://www.unicode.org/reports/tr31/

[2] http://www.w3.org/TR/charmod-norm/#unicodeNormalization


From: Steven Pemberton [mailto:Steven.Pemberton@cwi.nl]
Sent: Wednesday, November 26, 2014 1:50 PM
To: Steven Pemberton; Phillips, Addison
Cc: www-international@w3.org<mailto:www-international@w3.org>; Forms WG
Subject: Re: "International" email addresses [I18N-ACTION-374]

Addison, I18N group,

Many thanks for the discussion so far, and for creating an issue for this topic.

To add to the discussion, I would like to point out the several dimensions to this issue which have been exposed:

1. Syntax, Static Semantics, Dynamic Semantics

To draw an analogy with programming languages, there are several properties of an identifier that can be validated:
Things that can be checked at compile time:
   1. Syntax: Can this thing be an identifier?
   2. Static semantics: Has it been declared? (etc)
Things that can be checked at run-time:
   1. Does it have a value? (etc)

With respect to validating email addresses, there are several comparable properties:
   1. Syntax: Could this string imaginably be a valid email address (regardless of specific details for instance for particular zones, or available TLDs).
   2. Static Semantics: Is this string an allowable email address, taking into account current rules for zones, which TLDs there are, etc.
   3. Dynamic semantics: Does the domain really exist? Does the email address really work?

There is another dimension too, that XML Schema distinguishes as "lexical space" and "value space"[1]:
   1. Lexical space: in this case, what the user thinks of, and types in, as a valid email address.
   2. Value space: in this case the email address as it might go over the wire, which may include puny-code processing.

It is noticeable that many answers across the internet to the vexing question of what is a valid international email address mix these things up in lots of interesting ways, without properly distinguishing them.

In this case, the XForms group is only interested in the Syntax of the Lexical Space. We are not interested, at the level of processing that we are now talking about, in whether it is a valid domain, if the zone parts follow the rules for that zone, or whether the email address really exists. The user may be typing in an address that represents a future address for a domain that doesn't yet exist, or for a TLD that doesn't yet exist.

As a result, I still believe that my original message was more or less right on this point: a syntactically correct email address is defined by rfc5322 as modified by rfc6532:

   address: atom-list "@" atom-list.
   atom-list: atom ( "." atom )*
   atom: C+
   C: any character in the world EXCEPT (),.:;<>@[\]

with the added exclusion of control characters in the list for C.

[1] http://www.w3.org/TR/xmlschema-2/#value-space


Best wishes,

Steven Pemberton
For the Forms WG

On Thu, 20 Nov 2014 17:37:23 +0100, Phillips, Addison <addison@lab126.com<mailto:addison@lab126.com>> wrote:

Dear Steven and XForms,

Firstly, the WG *very much* welcomes further discussion from any and all on this list: this is how we find stuff out. (Thanks to Anne, JcK, Jungshik, and Shawn for contributions so far)

This is just a note to let you know that the Internationalization WG has taken up a discussion of this topic, which has, obviously, some interesting issues associated with it. We’re aware that, although “EAI” (email address internationalization) has been slow to mature and gain traction, there are serious efforts from vendors and in various countries to bring non-ASCII mail addresses into the mainstream.

This doesn’t play well with the current description in HTML (cited by Anne) or various other places. As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

The Internationalization WG is creating a discussion page to capture the issues [1]. We have not had a chance to discuss the issue in greater depth yet, but the WG’s consensus is that this is an interesting problem needing further investigation and documentation. Please note that, owing to the Thanksgiving holiday in the USA, the Internationalization WG is unlikely to make much more of a response for a couple of weeks.

Regards (for I18N),

Addison

[1] https://www.w3.org/International/wiki/EAI_Address_Issues



From: Shawn Steele [mailto:Shawn.Steele@microsoft.com]
Sent: Wednesday, November 19, 2014 11:37 AM
To: Jungshik SHIN (신정식)
Cc: Anne van Kesteren; Steven Pemberton; www-international@w3.org<mailto:www-international@w3.org>; Forms WG
Subject: RE: "International" email addresses

Validating the IDN part is much more complicated than validating the local part, because you need to know the IDN rules.  Which means it probably isn’t just a “simple” regex.

So maybe the rule should allow Unicode in the domain part and encourage complete IDN validation as an additional step?

-Shawn

From: jshin1987@gmail.com<mailto:jshin1987@gmail.com> [mailto:jshin1987@gmail.com] On Behalf Of Jungshik SHIN (???)
Sent: Wednesday, November 19, 2014 10:53 AM
To: Shawn Steele
Cc: Anne van Kesteren; Steven Pemberton; www-international@w3.org<mailto:www-international@w3.org>; Forms WG
Subject: Re: "International" email addresses

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI support in email form validation) although the summary is a bit misleading (it only talks about IDN).

Jungshik

On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <Shawn.Steele@microsoft.com<mailto:Shawn.Steele@microsoft.com>> wrote:
Updating that to support EAI would be good.

-----Original Message-----
From: annevankesteren@gmail.com<mailto:annevankesteren@gmail.com> [mailto:annevankesteren@gmail.com<mailto:annevankesteren@gmail.com>] On Behalf Of Anne van Kesteren
Sent: Wednesday, November 19, 2014 2:07 AM
To: Steven Pemberton
Cc: www-international@w3.org<mailto:www-international@w3.org>; Forms WG
Subject: Re: "International" email addresses

On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <Steven.Pemberton@cwi.nl<mailto:Steven.Pemberton@cwi.nl>> wrote:
> So as far as I can see, an internationalised email address is:
>
>  address: atom-list "@" atom-list.
>  atom-list: atom ( "." atom )*
>  atom: C+
>  C: any character in the world EXCEPT (),.:;<>@[\]
>
> a) Do you agree?
> b) It was really hard to find this out. The internet is rife with
> people asking and getting bad answers. Please help the internet by
> being definitive.

I recommend matching HTML's definition:

https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address



--
https://annevankesteren.nl/
Received on Thursday, 27 November 2014 06:42:58 UTC