Re: Rough cut on "Domain Name" part of IRI trilogy from Martin J. Dürst on 2009-11-05 (public-iri@w3.org from November 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 05 Nov 2009 19:52:30 +0900
To: Larry Masinter <masinter@adobe.com>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4AF2AE6E.6010307@it.aoyama.ac.jp>
Hello Larry,

On 2009/11/05 8:19, Larry Masinter wrote:
> I'd wanted to get this out before the ID cut-off, but I'd
> gone around a few too many times, and want to start over.

Many thanks for all your work. This is very useful to see how this would 
work out. I'm going to just shoot, to try to cover as many issues as I 
can come up with.


> I've picked out the parts of the IRI document that talk
> about ireg-name and holding of host names, and made
> an outline of what I think could be moved into a separate
> document.

I think the idea of having a document that could be reviewed easily by 
the IDNA community is very valuable. But having looked over the outline 
below, I also have several doubts.

> ========================================================
> The document parts would say:
>
> * Applicability
> * Syntax (what's allowed)
> * Processing (what to do with what you have)
> * Translation (convert to a reg-name for use in a URI)
> * Reconversion (convert a URL reg-name into an ireg-name)?
> * Comparison (how to compare the ireg-names)
>
> These parts would be used by the "main" document which
> would have the same components.

Having two (or three?) documents with almost identical structure seems 
to make things easy for certain reviewers, but needlessly difficult for 
actual implementers and users. Also, I think the stuff you extracted 
contains both protocol-like parts and simple recommendations, so I'm not 
sure BCP (as currently suggested in the Charter) would actually be 
appropriate.

I'm also worried that the wording for LEIRIs and HTML5 references will 
get quite complicated, because they will have to change some grammar 
productions that are spread across multiple documents.

> ======================================================
> Introduction
>
>    This document describes syntax, processing, and
>    comparison of the "ireg-name" component of IRIs.
>    It is a separate document to focus discussion
>    and coordination.

Do we know other examples in the IETF where specs were split up along 
similar lines?

> =============
> Applicability
>
>    These methods only apply to ireg-name parts of IRIs.
>    Domain Names may appear in parts of an IRI other
>    than the ireg-name part.  It is the responsibility
>    of scheme-specific implementations to apply the
>    necessary conversion if needed otherwise.

Nit: remove "otherwise".

>    For example if the Internationalized Domain Name
>    is part of 'iquery' component of a HTTP URI, the
>    interpretation of the domain name is up to the
>    server, e.g., trying to validate the Web page at
>    http://r&#xE9;sum&#xE9;.example.org
>    would lead to an IRI of
>     http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
>     example.org, which would convert to a URI of
>     http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
>     example.org.  In this case, the server-side
>     implementation is responsible for making the
>     necessary conversions to be able to retrieve the Web
>     page.
>
> =======
> Syntax:
>
>    Currently, the IRI draft contains the following definition of
>    ireg-name:
>
>     ireg-name      = *( iunreserved / sub-delims )
>     sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
>                    / "*" / "+" / "," / ";" / "="
>     iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
>     ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                    / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                    / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                    / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                    / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                    / %xD0000-DFFFD / %xE1000-EFFFD
>
>    This doesn't seem right to me -- why would we allow
>    sub-delims in domain names?

This is taken over from reg-name in RFC 3986. I think the main reason 
for this is to not be overly restrictive. The DNS itself allows any 
bytes, and there may be (currently or in the future) special URI/IRI 
schemes than want or need to make use of e.g. a $ character in a domain 
name.

>    What about %xx

Personally, I very much hope that this will go back in, for many good 
reasons others have already given (I can summarize if needed).

> ==========
> Processing:
>     IRI processors may process Unicode strings directly.
>
>     As an example, the restrictions of [RFC3490] on bidirectional
>     domain names correspond to treating each label of a domain name as
>     a component for schemes with ireg-name as a domain name.

Having to split the bidi stuff across different documents doesn't look 
attractive, because the bidi community definitely should look at all of 
it, and implementers and users benefit from having it all in one place. 
But this is not a dealbreaker, we can simply point to bidi 
considerations elsewhere from this document.

>     (give advice about how to invoke gethostname?)

For this, please see 
http://tools.ietf.org/html/draft-iab-idn-encoding-00 for some details on 
issues.

>     (handling of percent-encoded things that aren' allowed)

Does that refer to Unicode character strings that are not valid IDNs 
(because one or more labels are not valid U-Labels)? Or does it refer to 
something else?

Please note that if anywhere, this is the place where we have to deal 
with the issues of IDNA 2003 vs. IDNA 2008 and with mapping (case, 
NF(K)C,...). One question here is how and to what extent we can or 
should use Unicode TR 46

> =====
> Converting an ireg-name of IRI to a host name:
>
>     Schemes that allow non-ASCII based characters in the reg-name (ireg-
>     name) position MUST convert the ireg-name component of an IRI as
>     follows:
>
>     Replace the ireg-name part of the IRI by the part converted using the
>     ToASCII operation specified in Section 4.1 of [RFC3490] on each dot-
>     separated label, and by using U+002E (FULL STOP) as a label
>     separator, with the flag UseSTD3ASCIIRules set to FALSE, and with the
>     flag AllowUnassigned set to FALSE.  The ToASCII operation may fail,
>     but this would mean that the IRI cannot be resolved.  In such cases,
>     if the domain name conversion fails, then the entire IRI conversion
>     fails.  Processors that have no mechanism for signalling a failure
>     MAY instead substitute an otherwise invalid host name, although such
>     processing SHOULD be avoided.

This is very IDNA 2003-specific. We need to think about how to update 
this for IDNA 2008.

>        ((DESIGN QUESTION: What about e.g.
>        http://r%C3%A9sum%C3%A9.example.org in an IRI?  Will that get
>        converted to punycode, or not?))

We have to distinguish conversion from IRI to URI and resolution. IRI 
resolution in RFC 3987 was *defined* via conversion to URIs. That's why 
these two issues easily get mixed up. But there was no need to 
*implement* IRI resolution via conversion to URIs, and so actual 
implementations in browsers may do things a bit differently. [I'm not 
speaking about the differences that lead to accepting stuff that's not a 
valid URI/IRI, or that leads to a different resolution, just different, 
more direct, implementation.]

Simple conversion also often may happen in environments where not too 
many resources are available. That's why uniform conversion (just use 
UTF-8 and %HH) should be an option, also because it's better in the long 
term. On the other hand, resolution often happens with scheme-specific 
logic and knowledge about naming systems,... available. So it may be 
helpful for implementers to more clearly distinguish conversion and 
resolution.

>     Various IRI schemes may allow the usage of Internationalized Domain
>     Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
>     Character Normalization also applies to IDNs, as discussed in
>     Section 5.3.3.

This again is something where we have to look at the interaction of IDNA 
2003 and IDNA 2008.

> ========
> Converting host names in URIs to I18N host names:
>      punicode to Unicode

The punycode to Unicode part is easier with respect to IDNA 2003 vs. 
IDNA 2008 because we don't have to take mapping into account, and 
special characters such as sharp-s or final sigma seem to pose no 
difficulties (if they are in punycode, just convert them back; need a 
warning that in certain contexts, the conversion may not roundtrip).

> =======
> Comparing host names:
>      case insensitivitiy for ascii
>      dealing with variant forms?

The comparison ladder also needs some text to deal with IDNA 2003 vs. 
IDNA 2008, but because the comparison ladder doesn't need a single way 
of doing things, such text is comparatively easy to write.

> ========
> processing of "host" header?

This definitely has to go into the HTTP spec. If the HTTPbis WG doesn't 
have an issue for this yet, please let's ask Mark to open one.


We would also need some advice on creating/generating the ireg-name part 
of an IRI.


Regards,   Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 5 November 2009 10:53:25 UTC