- From: John C Klensin <john+w3c@jck.com>
- Date: Thu, 20 Nov 2014 16:00:42 -0500
- To: Asmus Freytag <asmusf@ix.netcom.com>, "Phillips, Addison" <addison@lab126.com>, Steven Pemberton <Steven.Pemberton@cwi.nl>
- cc: www-international@w3.org, Forms WG <public-forms@w3.org>
--On Thursday, November 20, 2014 09:05 -0800 Asmus Freytag <asmusf@ix.netcom.com> wrote: > On 11/20/2014 8:37 AM, Phillips, Addison wrote: >> As Shawn and John note, a regex description of IDNA is >> probably impossible. At best, such a regex would be an >> approximation. >> >> > The problem is not that it's impossible to do a rigorous > description (essentially a regex) of the IDN rules for a given > zone, but that the description varies along the tree, and that > the knowledge of the rules that apply at each level is > imperfect. >... Asmus, (Reluctantly putting on four separate virtual hats here: editor of RFC 5321 (SMTP), co-author of RFC 6055 (the IAB Domain Name Encoding spec); co-editor of RFC 5890, 5891, and 5894 (the framework and definitions specification, protocol specification, and background and rational document for IDNA2008) and contributor to most of the rest of the IDNA documents; and EAI WG co-chair and co-editor of RFC 6530 (the overview and definitions document for the EAI work) and contributor to most of the other email address and header internationalization documents.) Basically I agree with the above, but I would have said "but, in addition..." rather than "but that". To be less cryptic and in the hope of not having the discussion deteriorate into the confusion that has characterized parts of the web-email interface for the last 15 or 20 years, there are several separate issues, and they are not just about levels. I know you know most, if not all, of this but, in the hope of drawing things together: (1) The IDNA specifications (RFCs 5890ff) provide a set of processing rules that, together, define the protocol-level validity of IDN labels. There was never any intent that those rules be completely described by syntax alone and, despite the XML effort you describe (excerpt quoted below), I don't believe a complete syntax-based description is possible. In addition, while the rules are intended to be constant across versions of Unicode, the list of permitted code points change with Unicode versions and, because per-version exception processing in the IETF is allowed for, the relationship is expected to be mostly deterministic but may not be entirely so. "Protocol-level validity" effectively determines only the set of labels (and code points, but the two are not the same) that cannot be used in any zone. They do not determine what labels _can_ be used in a zone. Per-zone criteria are required to specify only labels that IDNA2008 allows on a protocol basis, but zones are expected to control their own subsets and label repertoire. That expectation is an explicit requirement of IDNA2008. IDNA2008 also requires some very specific validity checks of any applications doing DNS lookups. It would be, IMO, unwise to not follow and enforce those requirements. But, as discussed in more detail below, there are many possible labels, and even more fully-qualified domain names, that are valid under IDNA2008 but not valid in practice (whether actually registered or not). (2) At the TLD level (entries in the root zone), ICANN processes control the valid names. The rules are expected to be very conservative to, among other things, eliminate any plausible chance of confusion among names. At present, the decisions about what is permitted and what is not are controlled by two separate processes --one for ccTLD and one for new gTLDs-- that use different methods and criteria. A new system has been developed for, at least, new gTLDs (I do not believe it is not yet clear how it will apply to country-based TLDs), but it is not applicable to existing TLDs or applications now in the system and has therefore not been tested in practice. (3) At the second level (names appearing within TLDs), policies differ by zone, with a range from "no IDN labels" to "almost any IDN label allowed by IDNA2008" with "only specific characters drawn from specific scripts or languages" lying somewhere in between, with some zones applying additional restrictions prohibiting specific strings and types of strings (just as has been the case for non-IDN labels). The long time ICANN policy has been that each zone is allowed by develop its own rules although they are invited (sometimes expected) to report the characters they allow to IANA. As I understand it, the new ideas represented by: > As I mentioned, there's an effort underway to define an XML > format that allows one to capture any known descriptions in > (essentially) a regex-like format expressed in XML that can be > parsed and evaluated by a common engine. are of importance to those second-level tables because it should be a considerable improvement over simple lists of code points. However, anyone expecting to use that format should understand that, unless ICANN makes major changes in policy, using the mechanism will require that one identify the TLD, identify the second-level rule set associated with that TLD (of there is one) and then apply that rule set. I note in passing that the ccTLD community has very strongly rejected ICANN's ability to require that they submit such tables and even more strongly rejected ICANN's authority to tell them what the registration rules should be. Also, if a zone prohibits particular labels for moral, religious, aesthetic, political, or other reasons, it is not clear whether any regex-like algorithm will be of significant use in recording those rules (which also tend to be moderately volatile). > If/when IANA's registry gets converted to this format, you > should be able to do IDN validation, down to the second level > at least, to any level of desired accuracy by querying the > correct tables (or able to build approximate regexes with > known degrees of accuracy - because you could then test them > against any published full specifications). > Anyway, you find a draft here: > https://datatracker.ietf.org/doc/draft-davies-idntables/ Indeed, subject to the comments above. Note that, because there is no requirement for uniform policies among TLDs, for second-level domains one may well have to deal with at least a thousand or two separate sets of rules in the near future. (4) At the third level and below, labels are essentially open season, constrained only by IDNA2008 and whatever sense of proprietary and user protection exists for a particular zone. Even if individual zones were to publish their rules, we are talking about many millions of zones, each potentially with its own rules. An SLD zone could try to restrict the names its delegated zones use by contract, but such strategies have not proven successful in the past, especially for subdomains of those subdomains and below, and one could even argue that the DNS was designed to make enforcement of such rules difficult. So, it is possible to make syntax-based rules that will tell you what domains (or labels) are clearly invalid. One can push those rules further by investing additional effort and processing time in identifying a larger population of invalid labels as invalid. The one thing that, IMO, one should be careful about is to not adopt rules, or extrapolate rules and restrictions from one domain to another that would identify perfectly valid and registered domain names as invalid. We have had that problem fairly extensively already due, for example, to browsers, forms, or other software at the web-email boundary deciding that such characters as "/", "+", and even "." are invalid in email addresses. Doing so makes legitimate addresses inaccessible and causes a great deal of unhappiness among users and those who use the web, especially those who treat email addresses as personal identifiers. As those who are concerned about making domains containing IDN labels more consistently accessible are fond of pointing out (somewhat unfairly given the history), we used to have a common practice of applications "knowing" all the TLD names or at least the rules by which TLD names were formed. When new TLDs and TLDs not conforming to those historical naming rules, were introduced, they were inaccessible from those applications until they were upgraded (sometimes years or longer). That is generally a bad situation. It is perhaps even worth pointing out that national and cultural sensitivity about IDNs run extremely deep. If a browser vendor, or user of a form system, wanted to experiment with whether a particular country could be provoked far enough to make use or importation of a particular browser, product, or web site illegal, making it impossible to use valid domains that the country considered culturally or strategically important would probably make a good test case for such an experiment. There are tradeoffs about load on the root servers, but I have some sympathy for enforcement of the IDNA2008 rules only and otherwise following the principle that Shawn Steele advocates: --On Thursday, November 20, 2014 19:38 +0000 Shawn Steele <Shawn.Steele@microsoft.com> wrote: > Personally I don't much see the point. If it resolves > it's valid. If it doesn't then most apps could care less > if it's well formed. All of the above is strictly about the domain part of an email address. As I and many others have noted, the issues with the local part are entirely different and should really be discussed separately. But I do feel a need to comment on one suggestion: --On Thursday, November 20, 2014 17:54 +0100 Mark Davis ☕️ <mark@macchiato.com> wrote: > The only change requiring a bit of work is the local-part. For > that, I tend to agree with Anne that the EAI spec is overly > broad (for compatibility's sake), and that the HTML spec can > be somewhat tighter. That "broadness" of the EAI specs are due to two things, a need to be consistent with SMTP and a need to reflect actual practices in email address usage. I note in particular that SMTP requires that upper and lower case local parts be treated as distinct, even in ASCII. Equivalencing or aliasing of strings that differ only by case (or in other ways) is explicitly permitted and mail server operators have been advised for decades to not have such strings identify separate mailboxes unless they have very specific reasons to do so. Applied to the EAI environment, that rule saves a world of pain (pain we have experienced with IDNs and continue to experience) because the decision as to whether one string is the upper or lower case equivalent of another is determined entirely in the context of the server supporting the mailbox -- sending or intermediate systems are not allowed to assume the equivalence. Those supposedly "over broad" rules therefore protect us from culturally-unpleasant arguments about the various case folding edge cases. Please don't break that by deciding to impose "tighter" rules. Also understand that, if HTML or the web-email interface makes up its own set of rules, you folks will be taking responsibility for telling the owners and users of email systems what email local parts they can use and, given the number of environments in which personal names are used as part of local parts, what names they can have or give their children. I wouldn't want to go there. YMMD. best, john
Received on Thursday, 20 November 2014 21:01:20 UTC