RE: IRI-Everywhere from Martin Duerst on 2002-11-18 (www-tag@w3.org from November 2002)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 19 Nov 2002 07:50:25 +0900
To: "Julian Reschke" <julian.reschke@gmx.de>, <www-tag@w3.org>
Cc: www-international@w3.org
Message-Id: <4.2.0.58.J.20021116045050.079785d0@localhost>
Hello Julian,

Many thanks for your comments. I have copied www-international,
where comments about the IRI draft should be sent/copied.

At 22:48 02/11/13 +0100, Julian Reschke wrote:

>I'd like to add a few facts that seem to get overlooked again and again :-)
>
>1) Allowing the space character in IRIs makes it impossible to use the 
>space character as delimiter between IRIs. Specs that as of now use 
>white-space separated lists of URIs (such as XML Schema for 
>namespaceLocation) *will* break if an IRI contains a space character.

Please note that the IRI spec doesn't advocate using spaces; it clearly
warns against them. If you see a good way to make this warning clearer,
please tell me.


>2) IRIs are not URIs (well, many of them). "Silently" replacing URI (refs) 
>by IRI (refs) in spec revisions (such as XML namespaces) potentially 
>breaks applications that assume URI-ness (such as that only ASCII 
>characters are used).

The discussion currently is not about a silent replacement, but
about a new version of the namespaces spec.


>3) QName vs URI: if XML namespaces allow IRI refs as namespace names, the 
>issue of mapping QNames to URIs will get even messier as it *already* is.

Because the conversion from IRIs to URIs is well-defined, I think
saying it gets messier is not at all appropriate. There may be an
additional step, but this step is well-defined.


>For the record: I'm in favor of IRIs

Thanks!


>- if they stay focused on the issue of I18N -- allowing whitespace in the 
>identifier does not really fit into this requirement (IMHO),

Well, yes. But then one could argue that there are a lot of other
spaces in Unicode. So somebody who would like to use some spacing
in an IRI could always use a non-breaking space, or something similar.
Of course, this would be bad because the machine would notice the
difference between the usual space and the non-breaking space,
but users may not.


>- if it's always clear whether you're looking at a IRI or a URI (when 
>specs "upgrade" from URI to IRI, this may require out-of-band information),

Well, that's always fairly easy to check, or isn't it.
What's the issue is knowing whether you may expect an IRI, or not.


>- if they require full normalization

Can you please explain what you mean by that?
If you mean normalization as in getting rid of alternative
forms of codepoint sequences that the Unicode Standard defines
as canonically equivalent, e.g. by using NFC, then the current
design for IRIs is based on the following assumptions:

- There may be some specific needs to use non-NFC data in an IRI,
   the clearest example is that of a form that allows non-normalized
   input and sends back normalized output, which leads to an IRI
   with some non-normalized data after the '?'.

- Whenever there are no specific needs, use NFC. This in particular
   applies when converting from a legacy encoding (such as iso-8859-1)
   to an Unicode-based encoding

Because of the extremely broad scope of URIs/IRIs, I'm not sure
we can be more strict than that.


>Proposal/Question:
>
>- would it make sense to deprecate URIs that can not be transformed to 
>IRIs (that is, those URIs that do contain %-escapes that do not map to 
>UTF-8 representations of fully normalized Unicode strings)?

We can, and hopefully will, deprecate the creation of new URIs that
contain such escapes. We cannot deprecate the use of such URIs where
they already exist, because we don't want to have existing URIs to go
away. Also, there may be some cases (kind of similar to the data: URI)
that encode completely binary information. Although I haven't seen any
actual examples, and I don't think many will turn up (the data: scheme
encodes binary data as base64).

All this is of course up to the revision of RFC 2396,
rather than the IRI spec.


Regards,    Martin.
Received on Monday, 18 November 2002 17:51:07 UTC