Re: "International" email addresses [I18N-ACTION-374]

I mostly about the 3 distinctions that Steven is drawing. It is certainly
important to distinguish between "well-formed" (syntax) and "valid" (would
actually work at runtime). However, the syntactic distinction could be
tighter than what he suggests, but looser than
https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address,
since the latter doesn't take into account either EAI or IDNA. I'd suggest
something like the following:

email         = local-part "@" host
local-part    = 1*atext2 *("." 1*atext2)
host          = < as defined in https://url.spec.whatwg.org/#host-parsing >

atext2        = atext | utext
atext         = < as defined in
http://tools.ietf.org/html/rfc5322#section-3.2.3 >
utext         = XID_Start 1*XID_Continue
XID_Start     = < as defined in http://www.unicode.org/reports/tr31 >
XID_Continue  = < as defined in http://www.unicode.org/reports/tr31 >


Additional conditions:
 atext2 must be in NFC format, as defined in http://www.unicode.org/reports/tr15

Notes:
 * for local-part, see dot-atom-text in rfc5322 section 3.2.3
 * the above doesn't provide for quoted email addresses; the syntax
would have to be enhanced to allow for those.
 * the restriction to NFC is recommended in
http://tools.ietf.org/html/rfc6530#section-10.1, but not required
there. (I'd prefer NFKC over NFC.)
 * the restriction to a Unicode identifier is not in rfc6530, but
helps to prevent bizarre email addresses. However, it could be made
more lenient, eg to allow symbols (if you want your email address, for
example, to be an emoji).



Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Wed, Nov 26, 2014 at 11:19 PM, Phillips, Addison <addison@lab126.com>
wrote:

>  Hi Steven,
>
>
>
> I understand about the desire to limit yourself to the lexical space
> (which is something you can reasonably address and which has the most
> utility for what you’re working on).
>
>
>
> I do have some concerns about your suggested syntax. While it certainly is
> consistent with what RFC6532 says, I’d be concerned that, for example,
> ‘atom’ can start with combining marks or consist solely of non-starting
> Unicode code points or other values that would be problematic. These are
> the sorts of problems described in [1] and [2]. That is, I’m pretty sure
> that the following Unicode code point sequence isn’t ever a valid email
> address, notwithstanding it’s apparent “lexical validity”:
>
>
>
> U+0300 U+0301 U+FE0F U+0040 U+09C4 U+002E U+0063 U+006F U+006D
>
>
>
> (that’s two combining accents, a variation selector, the @ sign, a Bengali
> combining vowel marker, “dot com”)
>
>
>
> So I’d suggest that ‘atom’ at least always starts with a Unicode code
> point with a combining class of 0 (or possibly an unassigned code point for
> a given version of Unicode that might later been assigned a non-zero
> combining value).
>
>
>
> Addison
>
>
>
> [1] http://www.unicode.org/reports/tr31/
>
> [2] http://www.w3.org/TR/charmod-norm/#unicodeNormalization
>
>
>
> *From:* Steven Pemberton [mailto:Steven.Pemberton@cwi.nl]
> *Sent:* Wednesday, November 26, 2014 1:50 PM
> *To:* Steven Pemberton; Phillips, Addison
> *Cc:* www-international@w3.org; Forms WG
> *Subject:* Re: "International" email addresses [I18N-ACTION-374]
>
>
>
> Addison, I18N group,
>
>
>
> Many thanks for the discussion so far, and for creating an issue for this
> topic.
>
>
>
> To add to the discussion, I would like to point out the several dimensions
> to this issue which have been exposed:
>
>
>
> 1. Syntax, Static Semantics, Dynamic Semantics
>
>
>
> To draw an analogy with programming languages, there are several
> properties of an identifier that can be validated:
>
> Things that can be checked at compile time:
>
>    1. Syntax: Can this thing be an identifier?
>
>    2. Static semantics: Has it been declared? (etc)
>
> Things that can be checked at run-time:
>
>    1. Does it have a value? (etc)
>
>
>
> With respect to validating email addresses, there are several comparable
> properties:
>
>    1. Syntax: Could this string imaginably be a valid email address
> (regardless of specific details for instance for particular zones, or
> available TLDs).
>
>    2. Static Semantics: Is this string an allowable email address, taking
> into account current rules for zones, which TLDs there are, etc.
>
>    3. Dynamic semantics: Does the domain really exist? Does the email
> address really work?
>
>
>
> There is another dimension too, that XML Schema distinguishes as "lexical
> space" and "value space"[1]:
>
>    1. Lexical space: in this case, what the user thinks of, and types in,
> as a valid email address.
>
>    2. Value space: in this case the email address as it might go over the
> wire, which may include puny-code processing.
>
>
>
> It is noticeable that many answers across the internet to the vexing
> question of what is a valid international email address mix these things up
> in lots of interesting ways, without properly distinguishing them.
>
>
>
> In this case, the XForms group is only interested in the Syntax of the
> Lexical Space. We are not interested, at the level of processing that we
> are now talking about, in whether it is a valid domain, if the zone parts
> follow the rules for that zone, or whether the email address really exists.
> The user may be typing in an address that represents a future address for a
> domain that doesn't yet exist, or for a TLD that doesn't yet exist.
>
>
>
> As a result, I still believe that my original message was more or less
> right on this point: a syntactically correct email address is defined by
> rfc5322 as modified by rfc6532:
>
>
>    address: atom-list "@" atom-list.
>    atom-list: atom ( "." atom )*
>    atom: C+
>    C: any character in the world EXCEPT (),.:;<>@[\]
>
>
>
> with the added exclusion of control characters in the list for C.
>
>
>
> [1] http://www.w3.org/TR/xmlschema-2/#value-space
>
>
>
> Best wishes,
>
>
>
> Steven Pemberton
>
> For the Forms WG
>
>
>
> On Thu, 20 Nov 2014 17:37:23 +0100, Phillips, Addison <addison@lab126.com>
> wrote:
>
>
>
> Dear Steven and XForms,
>
>
>
> Firstly, the WG **very much** welcomes further discussion from any and
> all on this list: this is how we find stuff out. (Thanks to Anne, JcK,
> Jungshik, and Shawn for contributions so far)
>
>
>
> This is just a note to let you know that the Internationalization WG has
> taken up a discussion of this topic, which has, obviously, some interesting
> issues associated with it. We’re aware that, although “EAI” (email address
> internationalization) has been slow to mature and gain traction, there are
> serious efforts from vendors and in various countries to bring non-ASCII
> mail addresses into the mainstream.
>
>
>
> This doesn’t play well with the current description in HTML (cited by
> Anne) or various other places. As Shawn and John note, a regex description
> of IDNA is probably impossible. At best, such a regex would be an
> approximation.
>
>
>
> The Internationalization WG is creating a discussion page to capture the
> issues [1]. We have not had a chance to discuss the issue in greater depth
> yet, but the WG’s consensus is that this is an interesting problem needing
> further investigation and documentation. Please note that, owing to the
> Thanksgiving holiday in the USA, the Internationalization WG is unlikely to
> make much more of a response for a couple of weeks.
>
>
>
> Regards (for I18N),
>
>
>
> Addison
>
>
>
> [1] https://www.w3.org/International/wiki/EAI_Address_Issues
>
>
>
>
>
> *From:* Shawn Steele [mailto:Shawn.Steele@microsoft.com
> <Shawn.Steele@microsoft.com>]
> *Sent:* Wednesday, November 19, 2014 11:37 AM
> *To:* Jungshik SHIN (신정식)
> *Cc:* Anne van Kesteren; Steven Pemberton; www-international@w3.org;
> Forms WG
> *Subject:* RE: "International" email addresses
>
>
>
> Validating the IDN part is much more complicated than validating the local
> part, because you need to know the IDN rules.  Which means it probably
> isn’t just a “simple” regex.
>
>
>
> So maybe the rule should allow Unicode in the domain part and encourage
> complete IDN validation as an additional step?
>
>
>
> -Shawn
>
>
>
> *From:* jshin1987@gmail.com [mailto:jshin1987@gmail.com
> <jshin1987@gmail.com>] *On Behalf Of *Jungshik SHIN (???)
> *Sent:* Wednesday, November 19, 2014 10:53 AM
> *To:* Shawn Steele
> *Cc:* Anne van Kesteren; Steven Pemberton; www-international@w3.org;
> Forms WG
> *Subject:* Re: "International" email addresses
>
>
>
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI
> support in email form validation) although the summary is a bit misleading
> (it only talks about IDN).
>
>
>
> Jungshik
>
>
>
> On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <Shawn.Steele@microsoft.com>
> wrote:
>
> Updating that to support EAI would be good.
>
>
> -----Original Message-----
> From: annevankesteren@gmail.com [mailto:annevankesteren@gmail.com] On
> Behalf Of Anne van Kesteren
> Sent: Wednesday, November 19, 2014 2:07 AM
> To: Steven Pemberton
> Cc: www-international@w3.org; Forms WG
> Subject: Re: "International" email addresses
>
> On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <
> Steven.Pemberton@cwi.nl> wrote:
> > So as far as I can see, an internationalised email address is:
> >
> >  address: atom-list "@" atom-list.
> >  atom-list: atom ( "." atom )*
> >  atom: C+
> >  C: any character in the world EXCEPT (),.:;<>@[\]
> >
> > a) Do you agree?
> > b) It was really hard to find this out. The internet is rife with
> > people asking and getting bad answers. Please help the internet by
> > being definitive.
>
> I recommend matching HTML's definition:
>
> https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address
>
>
> --
> https://annevankesteren.nl/
>
>
>
>
>
>

Received on Thursday, 27 November 2014 07:59:20 UTC