Re: "International" email addresses [I18N-ACTION-374] from John C Klensin on 2014-11-20 (www-international@w3.org from October to December 2014)

From: John C Klensin <john+w3c@jck.com>
Date: Thu, 20 Nov 2014 16:00:42 -0500
To: Asmus Freytag <asmusf@ix.netcom.com>, "Phillips, Addison" <addison@lab126.com>, Steven Pemberton <Steven.Pemberton@cwi.nl>
cc: www-international@w3.org, Forms WG <public-forms@w3.org>
Message-ID: <5EA2585AE33DC0CC64EA3573@JcK-HP8200.jck.com>
--On Thursday, November 20, 2014 09:05 -0800 Asmus Freytag
<asmusf@ix.netcom.com> wrote:

> On 11/20/2014 8:37 AM, Phillips, Addison wrote:
>> As Shawn and John note, a regex description of IDNA is
>> probably  impossible. At best, such a regex would be an
>> approximation.
>> 
>> 
> The problem is not that it's impossible to do a rigorous
> description (essentially a regex) of the IDN rules for a given
> zone, but that the description varies along the tree, and that
> the knowledge of the rules that apply at each level is
> imperfect.
>...

Asmus,

(Reluctantly putting on four separate virtual hats here: editor
of RFC 5321 (SMTP), co-author of RFC 6055 (the IAB Domain Name
Encoding spec);  co-editor of RFC 5890, 5891, and 5894 (the
framework and definitions specification,  protocol
specification, and background and rational document for
IDNA2008) and contributor to most of the rest of the IDNA
documents; and EAI WG co-chair and co-editor of RFC 6530 (the
overview and definitions document for the EAI work) and
contributor to most of the other email address and header
internationalization documents.)

Basically I agree with the above, but I would have said "but, in
addition..." rather than "but that".

To be less cryptic and in the hope of not having the discussion
deteriorate into the confusion that has characterized parts of
the web-email interface for the last 15 or 20 years, there are
several separate issues, and they are not just about levels.  I
know you know most, if not all, of this but, in the hope of
drawing things together:

(1) The IDNA specifications (RFCs 5890ff) provide a set of
processing rules that, together, define the protocol-level
validity of IDN labels.    There was never any intent that those
rules be completely described by syntax alone and, despite the
XML effort you describe (excerpt quoted below), I don't believe
a complete syntax-based description is possible.  In addition,
while the rules are intended to be constant across versions of
Unicode, the list of permitted code points change with Unicode
versions and, because per-version exception processing in the
IETF is allowed for, the relationship is expected to be mostly
deterministic but may not be entirely so.   "Protocol-level
validity" effectively determines only the set of labels (and
code points, but the two are not the same) that cannot be used
in any zone.  They do not determine what labels _can_ be used in
a zone.  Per-zone criteria are required to specify only labels
that IDNA2008 allows on a protocol basis, but zones are expected
to control their own subsets and label repertoire.  That
expectation is an explicit requirement of IDNA2008.   

IDNA2008 also requires some very specific validity checks of any
applications doing DNS lookups.  It would be, IMO, unwise to not
follow and enforce those requirements.  But, as discussed in
more detail below, there are many possible labels, and even more
fully-qualified domain names, that are valid under IDNA2008 but
not valid in practice (whether actually registered or not).

(2) At the TLD level (entries in the root zone), ICANN processes
control the valid names.  The rules are expected to be very
conservative to, among other things, eliminate any plausible
chance of confusion among names.  At present, the decisions
about what is permitted and what is not are controlled by two
separate processes --one for ccTLD and one for new gTLDs-- that
use different methods and criteria.  A new system has been
developed for, at least, new gTLDs (I do not believe it is not
yet clear how it will apply to country-based TLDs), but it is
not applicable to existing TLDs or applications now in the
system and has therefore not been tested in practice.

(3) At the second level (names appearing within TLDs), policies
differ by zone, with a range from "no IDN labels" to "almost any
IDN label allowed by IDNA2008" with "only specific characters
drawn from specific scripts or languages" lying somewhere in
between, with some zones applying additional restrictions
prohibiting specific strings and types of strings (just as has
been the case for non-IDN labels).  The long time ICANN policy
has been that each zone is allowed by develop its own rules
although they are invited (sometimes expected) to report the
characters they allow to IANA.  As I understand it, the new
ideas represented by:

> As I mentioned, there's an effort underway to define an XML
> format that allows one to capture any known descriptions in
> (essentially) a regex-like format expressed in XML that can be
> parsed and evaluated by a common engine.

are of importance to those second-level tables because it should
be a considerable improvement over simple lists of code points.
However, anyone expecting to use that format should understand
that, unless ICANN makes major changes in policy, using the
mechanism will require that one identify the TLD, identify the
second-level rule set associated with that TLD (of there is one)
and then apply that rule set.    I note in passing that the
ccTLD community has very strongly rejected ICANN's ability to
require that they submit such tables and even more strongly
rejected ICANN's authority to tell them what the registration
rules should be.   Also, if a zone prohibits particular labels
for moral, religious, aesthetic, political, or other reasons, it
is not clear whether any regex-like algorithm will be of
significant use in recording those rules (which also tend to be
moderately volatile).

> If/when IANA's registry gets converted to this format, you
> should be able to do IDN validation, down to the second level
> at least, to any level of desired accuracy by querying the
> correct tables (or able to build approximate regexes with
> known degrees of accuracy - because you could then test them
> against any published full specifications).
 
> Anyway, you find a draft here:
> https://datatracker.ietf.org/doc/draft-davies-idntables/

Indeed, subject to the comments above.  Note that, because there
is no requirement for uniform policies among TLDs, for
second-level domains one may well have to deal with at least a
thousand or two separate sets of rules in the near future.

(4) At the third level and below, labels are essentially open
season, constrained only by IDNA2008 and whatever sense of
proprietary and user protection exists for a particular zone.
Even if individual zones were to publish their rules, we are
talking about many millions of zones, each potentially with its
own rules.  An SLD zone could try to restrict the names its
delegated zones use by contract, but such strategies have not
proven successful in the past, especially for subdomains of
those subdomains and below, and one could even argue that the
DNS was designed to make enforcement of such rules difficult.

So, it is possible to make syntax-based rules that will tell you
what domains (or labels) are clearly invalid.   One can push
those rules further by investing additional effort and
processing time in identifying a larger population of invalid
labels as invalid.    The one thing that, IMO, one should be
careful about is to not adopt rules, or extrapolate rules and
restrictions from one domain to another that would identify
perfectly valid and registered domain names as invalid.   We
have had that problem fairly extensively already due, for
example, to browsers, forms, or other software at the web-email
boundary deciding that such characters as "/", "+", and even "."
are invalid in email addresses.  Doing so makes legitimate
addresses inaccessible and causes a great deal of unhappiness
among users and those who use the web, especially those who
treat email addresses as personal identifiers.

As those who are concerned about making domains containing IDN
labels more consistently accessible are fond of pointing out
(somewhat unfairly given the history), we used to have a common
practice of applications "knowing" all the TLD names or at least
the rules by which TLD names were formed.  When new TLDs and
TLDs not conforming to those historical naming rules, were
introduced, they were inaccessible from those applications until
they were upgraded (sometimes years or longer).  That is
generally a bad situation.

It is perhaps even worth pointing out that national and cultural
sensitivity about IDNs run extremely deep.  If a browser vendor,
or user of a form system, wanted to experiment with whether a
particular country could be provoked far enough to make use or
importation of a particular browser, product, or web site
illegal, making it impossible to use valid domains that the
country considered culturally or strategically important would
probably make a good test case for such an experiment.

There are tradeoffs about load on the root servers, but I have
some sympathy for enforcement of the IDNA2008 rules only and
otherwise following the principle that Shawn Steele advocates:

--On Thursday, November 20, 2014 19:38 +0000 Shawn Steele
<Shawn.Steele@microsoft.com> wrote:

> Personally I don't much see the point.  If it resolves
> it's valid.  If it doesn't then most apps could care less
> if it's well formed.

All of the above is strictly about the domain part of an email
address.  As I and many others have noted, the issues with the
local part are entirely different and should really be discussed
separately.   But I do feel a need to comment on one suggestion:


--On Thursday, November 20, 2014 17:54 +0100 Mark Davis ☕️
<mark@macchiato.com> wrote:

> The only change requiring a bit of work is the local-part. For
> that, I tend to agree with Anne that the EAI spec is overly
> broad (for compatibility's sake), and that the HTML spec can
> be somewhat tighter.

That "broadness" of the EAI specs are due to two things, a need
to be consistent with SMTP and a need to reflect actual
practices in email address usage.  I note in particular that
SMTP requires that upper and lower case local parts be treated
as distinct, even in ASCII.  Equivalencing or aliasing of
strings that differ only by case (or in other ways) is
explicitly permitted and mail server operators have been advised
for decades to not have such strings identify separate mailboxes
unless they have very specific reasons to do so.  Applied to the
EAI environment, that rule saves a world of pain (pain we have
experienced with IDNs and continue to experience) because the
decision as to whether one string is the upper or lower case
equivalent of another is determined entirely in the context of
the server supporting the mailbox -- sending or intermediate
systems are not allowed to assume the equivalence.  Those
supposedly "over broad" rules therefore protect us from
culturally-unpleasant arguments about the various case folding
edge cases.   

Please don't break that by deciding to impose "tighter" rules.
Also understand that, if HTML or the web-email interface makes
up its own set of rules, you folks will be taking responsibility
for telling the owners and users of email systems what email
local parts they can use and, given the number of environments
in which personal names are used as part of local parts, what
names they can have or give their children.  I wouldn't want to
go there.  YMMD.

best,
    john
Received on Thursday, 20 November 2014 21:01:20 UTC