Some issues with the IRI document from Paul Hoffman / IMC on 2003-04-08 (public-iri@w3.org from April 2003)

From: Paul Hoffman / IMC <phoffman@imc.org>
Date: Tue, 8 Apr 2003 08:14:57 -0700
To: public-iri@w3.org
Message-Id: <p05210649bab89730a167@[63.202.92.152]>

[[ I sent this to Martin a few weeks ago, but we agreed that it was 
best to bring up on this new list. And, just to be clear, I don't 
think that IRIs should be debated to death, but the document needs to 
be clear. ]]

I think the document is fairly good, although I'm not much of a URI 
person, so I could be way off. After studying it this morning, I see 
where I got confused. I also disagree with some technical choices you 
make later in the spec.

Clarifications:

Section 1.1 needs to be a bit longer, and possibly split into two 
parts. You need more emphasis here that you are describing something 
that will go into protocol elements. In addition, you need an 
explicit discussion here about the difference between characters and 
encoded characters. You have this in section 2, but it is so 
important to understanding the applicability, it needs to be in 1.1.

Subsection (c) in 1.2 is a mess and is probably where I really lost 
it. The first sentence has too many subordinate clauses in it. But 
worse, you introduce UTF-8. That's where I got confused about 
characters vs. encoding. I still don't know why UTF-8 is brought up 
here. I propose that you start over on this subsection.

The last paragraph of 1.2 is confusing in the middle where you talk 
about UTF-8. 0xE9 is not the representation of a UTF-8 character. 
Even though the example is wrong, it got me stuck in UTF-8 mode, 
which helped get me stuck in thinking that you were talking sometimes 
about the encoding.

Technical issues:

You use NFC in Section 3.1. This goes against the theme of the 
guidelines in section 6. NFKC will cause less surprise if an IRI 
contains compatibility characters, so you should use NFKC instead, 
regardless of the history of NFC in the W3C.

I do not understand the logic of having Variants (B) and (C) in step 
1 in section 3.1. One is normalized, the other one isn't. Doesn't 
this sound like a recipe for disaster? Why did you differentiate 
between these two cases?

Does the bidi processing in section 4 match what is specified in 
Nameprep? If not, are there cases where a stand-alone IDN name will 
be displayed differently than the same name in an IRI? That would be 
a complete show-stopper, if true.

I think that section 5.1 (b) is a bad mistake. The four reasons you 
give are not strong enough for what seems like something that can 
cause huge conversion problems. I can also see this causing security 
problems.

--Paul Hoffman, Director
--Internet Mail Consortium

Received on Tuesday, 8 April 2003 12:06:08 UTC