- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Thu, 05 Nov 2009 19:52:30 +0900
- To: Larry Masinter <masinter@adobe.com>
- CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Hello Larry, On 2009/11/05 8:19, Larry Masinter wrote: > I'd wanted to get this out before the ID cut-off, but I'd > gone around a few too many times, and want to start over. Many thanks for all your work. This is very useful to see how this would work out. I'm going to just shoot, to try to cover as many issues as I can come up with. > I've picked out the parts of the IRI document that talk > about ireg-name and holding of host names, and made > an outline of what I think could be moved into a separate > document. I think the idea of having a document that could be reviewed easily by the IDNA community is very valuable. But having looked over the outline below, I also have several doubts. > ======================================================== > The document parts would say: > > * Applicability > * Syntax (what's allowed) > * Processing (what to do with what you have) > * Translation (convert to a reg-name for use in a URI) > * Reconversion (convert a URL reg-name into an ireg-name)? > * Comparison (how to compare the ireg-names) > > These parts would be used by the "main" document which > would have the same components. Having two (or three?) documents with almost identical structure seems to make things easy for certain reviewers, but needlessly difficult for actual implementers and users. Also, I think the stuff you extracted contains both protocol-like parts and simple recommendations, so I'm not sure BCP (as currently suggested in the Charter) would actually be appropriate. I'm also worried that the wording for LEIRIs and HTML5 references will get quite complicated, because they will have to change some grammar productions that are spread across multiple documents. > ====================================================== > Introduction > > This document describes syntax, processing, and > comparison of the "ireg-name" component of IRIs. > It is a separate document to focus discussion > and coordination. Do we know other examples in the IETF where specs were split up along similar lines? > ============= > Applicability > > These methods only apply to ireg-name parts of IRIs. > Domain Names may appear in parts of an IRI other > than the ireg-name part. It is the responsibility > of scheme-specific implementations to apply the > necessary conversion if needed otherwise. Nit: remove "otherwise". > For example if the Internationalized Domain Name > is part of 'iquery' component of a HTTP URI, the > interpretation of the domain name is up to the > server, e.g., trying to validate the Web page at > http://résumé.example.org > would lead to an IRI of > http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. > example.org, which would convert to a URI of > http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. > example.org. In this case, the server-side > implementation is responsible for making the > necessary conversions to be able to retrieve the Web > page. > > ======= > Syntax: > > Currently, the IRI draft contains the following definition of > ireg-name: > > ireg-name = *( iunreserved / sub-delims ) > sub-delims = "!" / "$" / "&" / "'" / "(" / ")" > / "*" / "+" / "," / ";" / "=" > iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar > ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF > / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD > / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD > / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD > / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD > / %xD0000-DFFFD / %xE1000-EFFFD > > This doesn't seem right to me -- why would we allow > sub-delims in domain names? This is taken over from reg-name in RFC 3986. I think the main reason for this is to not be overly restrictive. The DNS itself allows any bytes, and there may be (currently or in the future) special URI/IRI schemes than want or need to make use of e.g. a $ character in a domain name. > What about %xx Personally, I very much hope that this will go back in, for many good reasons others have already given (I can summarize if needed). > ========== > Processing: > IRI processors may process Unicode strings directly. > > As an example, the restrictions of [RFC3490] on bidirectional > domain names correspond to treating each label of a domain name as > a component for schemes with ireg-name as a domain name. Having to split the bidi stuff across different documents doesn't look attractive, because the bidi community definitely should look at all of it, and implementers and users benefit from having it all in one place. But this is not a dealbreaker, we can simply point to bidi considerations elsewhere from this document. > (give advice about how to invoke gethostname?) For this, please see http://tools.ietf.org/html/draft-iab-idn-encoding-00 for some details on issues. > (handling of percent-encoded things that aren' allowed) Does that refer to Unicode character strings that are not valid IDNs (because one or more labels are not valid U-Labels)? Or does it refer to something else? Please note that if anywhere, this is the place where we have to deal with the issues of IDNA 2003 vs. IDNA 2008 and with mapping (case, NF(K)C,...). One question here is how and to what extent we can or should use Unicode TR 46 > ===== > Converting an ireg-name of IRI to a host name: > > Schemes that allow non-ASCII based characters in the reg-name (ireg- > name) position MUST convert the ireg-name component of an IRI as > follows: > > Replace the ireg-name part of the IRI by the part converted using the > ToASCII operation specified in Section 4.1 of [RFC3490] on each dot- > separated label, and by using U+002E (FULL STOP) as a label > separator, with the flag UseSTD3ASCIIRules set to FALSE, and with the > flag AllowUnassigned set to FALSE. The ToASCII operation may fail, > but this would mean that the IRI cannot be resolved. In such cases, > if the domain name conversion fails, then the entire IRI conversion > fails. Processors that have no mechanism for signalling a failure > MAY instead substitute an otherwise invalid host name, although such > processing SHOULD be avoided. This is very IDNA 2003-specific. We need to think about how to update this for IDNA 2008. > ((DESIGN QUESTION: What about e.g. > http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get > converted to punycode, or not?)) We have to distinguish conversion from IRI to URI and resolution. IRI resolution in RFC 3987 was *defined* via conversion to URIs. That's why these two issues easily get mixed up. But there was no need to *implement* IRI resolution via conversion to URIs, and so actual implementations in browsers may do things a bit differently. [I'm not speaking about the differences that lead to accepting stuff that's not a valid URI/IRI, or that leads to a different resolution, just different, more direct, implementation.] Simple conversion also often may happen in environments where not too many resources are available. That's why uniform conversion (just use UTF-8 and %HH) should be an option, also because it's better in the long term. On the other hand, resolution often happens with scheme-specific logic and knowledge about naming systems,... available. So it may be helpful for implementers to more clearly distinguish conversion and resolution. > Various IRI schemes may allow the usage of Internationalized Domain > Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. > Character Normalization also applies to IDNs, as discussed in > Section 5.3.3. This again is something where we have to look at the interaction of IDNA 2003 and IDNA 2008. > ======== > Converting host names in URIs to I18N host names: > punicode to Unicode The punycode to Unicode part is easier with respect to IDNA 2003 vs. IDNA 2008 because we don't have to take mapping into account, and special characters such as sharp-s or final sigma seem to pose no difficulties (if they are in punycode, just convert them back; need a warning that in certain contexts, the conversion may not roundtrip). > ======= > Comparing host names: > case insensitivitiy for ascii > dealing with variant forms? The comparison ladder also needs some text to deal with IDNA 2003 vs. IDNA 2008, but because the comparison ladder doesn't need a single way of doing things, such text is comparatively easy to write. > ======== > processing of "host" header? This definitely has to go into the HTTP spec. If the HTTPbis WG doesn't have an issue for this yet, please let's ask Mark to open one. We would also need some advice on creating/generating the ireg-name part of an IRI. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 5 November 2009 10:53:25 UTC