Forwarded message 1
All,
these are my comments on the Unicode Consortium's Draft UTR#36 document,
Revision 1.16 (2005/05/09). They are posted here since this document is in
the review radar of the i18n-core wg and apparently I'm not allowed to
post to public-i18n-core@w3.org.
All in all it is a good document (modulo certain recommendations, IMHO,
I'll address that later), but structure is sometimes not respected: For
instance, though there's a whole section on IDNs (2.1), IDN issues keep
popping up through the rest of the doc. I am sorry if the following list
is somehow a mixture of core issues and editorial nits:
Section 1, 4th paragraph: "; and according to what you see it is". Is
there a piece of sentence missing there?
Section 1, 8th paragrpah: "While some browsers prevent this spoof by
lowercasing domain names, but others don't". I am not a native speaker,
but I guess it should be "domain names, others don't".
Section 2.1, 2nd paragraph: It's not actually about IDNs so it shouldn't
be placed here. Maybe directly under Section 2.
Section 2.1, 3rd paragraph: "using a process called compatibility
normalization (NFKC)". I guess that a direct reference to RFC 3491
(Nameprep) would be better placed here, since Nameprep = NFKC + a little
bit of something else.
Section 2.1, 4th paragraph: ", while the IDNA column shows the IDNA format
used to represent the string internally in International Domain Names".
First, the term IDNA is here introduced for the first time without further
explanation. Second, the column is actually called "IDN Internal", which
is an unfortunate name, I was expecting the term ACE ("ASCII Compatible
Encoding") to appear somewhere here. The term "International Domain Names"
is somehow unfortunate as well (all domain names are an international good
;-), the correct term is "Internationalized". My proposal for this whole
sentence is thus: ", while the ACE ("ASCII Compatible Encoding") column
shows the result of applying the ToASCII() operation (cf RFC 3490) to the
original IDN, which is the way this IDN is stored and queried in the DNS".
Section 2.1, 7th paragraph: "The IDN processing also removes case
distinctions by performing a case folding to reduce characters to a
lowercase form. [...] That means that we can focus on just the lowercase
characters". While I don't know whether it will be relevant for the
conclusion "we can focus on just lowercase", there are two remarks that
must be necessarily made:
* First, the IDNA operation ToASCII() will map to lowercase iff the label
contains some non-ASCII character. Thus ToASCII("DENIC.DE") = "DENIC.DE",
because all ASCII. The IDN processing has left the string unchanged.
* Second, domain names are case insensitive, but RFC 1034 and 1035, as
clarified by
http://www.ietf.org/internet-drafts/draft-ietf-dnsext-insensitive-05.txt,
introduce the concept of case preservation. To put it plainly: if I query
the DNS for "WWW.DENIC.DE", and the DNS contains information for
"www.denic.de", I will get exactly that information delivered, but the
answer will be titled "WWW.DENIC.DE".
Section 2.1, 9th paragraph: "two domain names would need to be
registered". It's a little bit unclear what is meant: Why would that be
needed? By whom should the be registered? Since this is not a technical
issue, I'd leave this note best left to the recommendations for the user
(where it can already be found: 2.10.1.B).
Section 2.1, 9th paragraph: The word "registry" appears for the first time
without further introduction. For somebody unfamiliar with domain names
and the ICANN terminology, it can appear to be unclear. I'd drop anyway
the sentence, because the statement "a registry may want to pay attention
to this" is more confusing than clarifying.
Section 2.1, 10th paragraph: s/international domain
names/internationalized domain names/
Section 2.1, 10th paragraph: "the registry can easily determine if a
proposed registration conflicts". I'd gently drop the valoration "easily":
given an input label of 63 characters (maximal length of a domain name),
each of which could be source of an entry in the "confusables" table, and
with the assumption that there's always only single target for the same
input (is that always the case?), the potential amount of 2^63 lookups in
the registration database to be done in realtime in order to work out a
possible conflict requires more computing power that most of the world
domain registries can afford today.
Section 2.1, 11th paragraph: I'd add a fourth bullet "Due to the
decentralized nature of DNS, registries do not control subdomains being
established beyond the domain name registered". This fact is relevant.
Together with problems like the one described in RFC 1535 (and God knows
which more to come) this issue could be a door to a new way of scam.
Section 2.5, 1st example: "to pretend to be a subdomain in" is not
correct. Better: "to pretend to be a URL under the domain"
Section 2.5, 1st paragraph after the example: "are disallowed by
StringPrep". Stringprep (no capital P) is introduced for the first time
without explaining in which way it is relevant to the IDNA standard. I'd
actually like to stick to a reference to Nameprep (as introduced before),
which -although just a profile of Stringprep- is directly relevant to
domain names.
Section 2.5, last but one paragraph: "to always visually distinguish the
second-level domain". That's a common gotcha: some registries actually
register at the third-level (greetings to my nominet.org.uk colleagues
from here :-), and there's no rule that forbids a TLD to register at the
fourth, fifth.. you just can't carve the second-level in stone.
Section 2.8: Actually very difficult itself to understand for a non-native
speaker. But since I didn't get it, I can't make any suggestion for
improvement. Somehow there are a lot of pronouns "this", "both", ... for
which I can't univocally found the reference.
Section 2.9: The security levels are a good idea, the names are
problematic though. I wouldn't like to claim that my registry assigns
domain names at Unicode's "security level minimal", though it's supposed
to be the second highest in the rank :-). Further: what is the "minimal +"
or "moderate +" supposed to mean? Please clarify.
Section 2.9, 1st paragraph after the security levels: "characters outside
of XID_Continue". This can't be unterstood by non-insiders. Please
clarify.
Section 2.9, 2nd paragraph after the security levels: That is probably
well-meant, but I wonder whether that suggestions wouldn't be best left to
usability experts.
Section 2.10: The recommendations are too domain-centric, I would have
expected to see recommendations for identifiers here.
Section 2.10.1, point A: s/browsers/browsers, mail clients and software in
general/
Section 2.10.1, point B: "Use the same IP address for both". This
recommendation bases on the belief, that a registered domain name always
has an IP address (and promulgates that the Internet is the web), but
that's not always the case: it could be a domain with only MX records (for
mail exchange), it even could be a domain which is blocked at the registry
(and thus can't be found in DNS). But even if all domains would have an IP
address and a webserver running, I find this a bad recommendation: maybe
I'd like that my, let's call whole-script confusable domains, point to
another website with a different message from the original one.
Section labelled "General Programmer Recommendations": incorrectly
numbered as 2.10.1. Correct following sections, too.
Section 2.10.2, point B.3: "display the domain name with a visually
highlighted domain name". Unintelligible.
Section 2.10.2, point C.1: "excluding the TLD". Please, don't carve in
stone that TLDs won't contain characters beyond ASCII in the future.
Section 2.10.2, point D.2: "If the domain has a whole-script confusable,
verify that both point to the same IP address". While displacing this
requirement from the registry to the user agent would be an improvement
towards leveraging the end-to-end design principle of the Internet, how
should that be practically performed? The client calculating 2^63 label
permutations and afterwards issuing that amount of DNS queries? Not
practicable, also consider the previous comments on 2.10.1.B. Please drop
this.
Section 2.10.3: Strange. The "User recommendations" in section 2.10.1 give
the impression that this document is encouraging the user (here: domain
name registrant) to take responsibility for the protection of their
trademark rights/IPR/security of their domains/etc. I would embrace that.
And so was the previous version 2 of UTR#36. But suddenly this new draft
gives an inconsistent twist with itself and includes these new points B.2
and B.3. Frankly: I don't think it's the task of a domain registry to
check whether certain domain names belong to the same registrants. Rules
which recommend that the domains "111.com" and "lll.com" (and "11l.com",
and "1l1.com", etc.) should belong to the same person haven't been
followed in the ASCII times and are not programmed to success in the
advent of IDN. More input from the TLD registry community would be needed
here.
My 0.02 Euros.
Marcos Sanz
DENIC eG