IRI issues (in quite some detail)

This is a laundry list of issues that have come up on the IRI spec 
update. They are grouped into things that are related where possible. I 
hope this is a fairly complete initial pass, but I'm sure there are 
still a few things missing.

In your replies, please distinguish addition of issues from discussion 
of specific issues.

- %encoding vs. punycode when converting from IRI to URI
   (see mail by Roy:
    and I-D by Dave Thaler:

- Update of Bidi section:
   - allow combining marks at end of component
   - adopt component restrictions to those in [IDNA-Bidi]
   - check about other syntactic characters (not only dot)
     and payload characters (e.g. %)
   [- rework examples]

- IDNA 2003 vs. IDNA 2008:
   - to map or not to map for IRI->URI and on resolution in general
     - what mapping to use (see
       for a potential direction)
     - what to do about ß (sharp s) and ς (final sigma)
       - short term
       - long term
   - advice for authors:
     - Always use prepped (in IDNA 2003 termiology) or
       legal U-Label (in IDNA 2008 terminology)
     - Avoid separators other than '.'
     - Avoid IDNs that are not legal in either IDNA 2003 or 2008 ?

LEIRIs and HTML5 references

- Are there other "main areas" (like XML and HTML) that warrant similar
   'preferential treatment' [let's really hope not] (see also
   (way incomplete))

- Naming these explicitly (or not)
   - What's the best name for HTML5 references

- Using syntax or procedure for definition
   (syntax seems to work better for the requirements of XML and LEIRIs,
    procedure may work better for HTML5)

- Place in spec: Appendix? Separate section (for each, or for both
   together?)? As part of a section 5 (Normalization and Comparison;
   probably not, seems confusing to many people)

- Mix with main IRI->URI procedure or not (ideally separate, but may
   not be easy for some aspects)

- What to keep in 'host' specs (e.g. definition of whitespace?)

HTML5 reference specific issues

- '\' as path separator

- '#' in fragment identifiers

- '[' and ']' other than for IPv6 literals

- Processing of other characters not allowed

- treatment of lonely '%' (not followed by 2 hex digits)

- special behavior for encoding in http: and https: query parts
   (use document encoding if available instead of UTF-8)

- some more (to be completed, including pointer to relevant documents 
(from Anne)

- How to advise authors,... against using 'bugwards-compatible' features
   (completed for LEIRIs, needs to be discussed and done for HTML5)

IRI issues
not already mentioned above)

Registration issues

- Allow definition of URI schemes simply in terms of IRIs?

- What other adjustments needed resulting from issues above?

Issues for individual schemes

- Piggibacking mailto:
   - Allowing UTF-8 officially where current email infrastructure
     does allow it
   - Fixing other issues in mailto:

- Updating mailto: for EAI (or creating a new scheme)

- Others?

URI issues (potentially)?
- do '[' and ']' need to be forbidden in URIs
- does '#' need to be forbidden in URI fragment parts

Regards,   Martin.

#-# Martin J. Dürst, Professor, Aoyama Gakuin University

Received on Monday, 12 October 2009 05:49:48 UTC