RE: IRI-Everywhere from Julian Reschke on 2002-11-20 (www-tag@w3.org from November 2002)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Wed, 20 Nov 2002 22:10:15 +0100
To: "Martin Duerst" <duerst@w3.org>, <www-tag@w3.org>
Cc: <www-international@w3.org>
Message-ID: <JIEGINCHMLABHJBIGKBCMEHBFNAA.julian.reschke@gmx.de>
> From: www-tag-request@w3.org [mailto:www-tag-request@w3.org]On Behalf Of
> Martin Duerst
> Sent: Monday, November 18, 2002 11:50 PM
> To: Julian Reschke; www-tag@w3.org
> Cc: www-international@w3.org
> Subject: RE: IRI-Everywhere
>
>
>
> Hello Julian,

Martin,

thanks for the feedback....

> Many thanks for your comments. I have copied www-international,
> where comments about the IRI draft should be sent/copied.
>
> At 22:48 02/11/13 +0100, Julian Reschke wrote:
>
> >I'd like to add a few facts that seem to get overlooked again
> and again :-)
> >
> >1) Allowing the space character in IRIs makes it impossible to use the
> >space character as delimiter between IRIs. Specs that as of now use
> >white-space separated lists of URIs (such as XML Schema for
> >namespaceLocation) *will* break if an IRI contains a space character.
>
> Please note that the IRI spec doesn't advocate using spaces; it clearly
> warns against them. If you see a good way to make this warning clearer,
> please tell me.

Well. Does it really make sense to *allow* them, and then later to warn
against them? What use case do space characters in IRIs solve? IMHO they
create unnecessary problems without visible benefits, and they don't seem to
fit into the internationalization agenda.

> >2) IRIs are not URIs (well, many of them). "Silently" replacing  URI
(refs)
> >by IRI (refs) in spec revisions (such as XML namespaces) potentially
> >breaks applications that assume URI-ness (such as that only ASCII
> >characters are used).
>
> The discussion currently is not about a silent replacement, but
> about a new version of the namespaces spec.

I chose the term "silently" because this change suddenly appeared in the
latest namespaces draft without having been discussed before (at least it
was only the last call announcement that catched my attention).

> >3) QName vs URI: if XML namespaces allow IRI refs as namespace names, the
> >issue of mapping QNames to URIs will get even messier as it *already* is.
>
> Because the conversion from IRIs to URIs is well-defined, I think
> saying it gets messier is not at all appropriate. There may be an
> additional step, but this step is well-defined.

How does that help? Several IRIs can map to the same URI (*), so this
doesn't help at all when trying to define a bijective mapping. Futhermore,
unless IRI finally decides about %hh vs %HH, the mapping to URIs is
ambiguous as well.

(*) for instance, the IRIs "foo:?" and "foo:%e2%82%ac" map to the same URI.


> >For the record: I'm in favor of IRIs
>
> Thanks!
>
>
> >- if they stay focused on the issue of I18N -- allowing whitespace in the
> >identifier does not really fit into this requirement (IMHO),
>
> Well, yes. But then one could argue that there are a lot of other
> spaces in Unicode. So somebody who would like to use some spacing
> in an IRI could always use a non-breaking space, or something similar.
> Of course, this would be bad because the machine would notice the
> difference between the usual space and the non-breaking space,
> but users may not.

Yet another reason to either forbid the space characters, or all characters
that render as space.

> >- if it's always clear whether you're looking at a IRI or a URI (when
> >specs "upgrade" from URI to IRI, this may require out-of-band
information),
>
> Well, that's always fairly easy to check, or isn't it.
> What's the issue is knowing whether you may expect an IRI, or not.

It's necessary to know when comparing them. URIs and IRIs are different
things, so you can't just compare them character-by-character.

> >- if they require full normalization
>
> Can you please explain what you mean by that?
> If you mean normalization as in getting rid of alternative
> forms of codepoint sequences that the Unicode Standard defines

Yes.

> as canonically equivalent, e.g. by using NFC, then the current
> design for IRIs is based on the following assumptions:
>
> - There may be some specific needs to use non-NFC data in an IRI,
>    the clearest example is that of a form that allows non-normalized
>    input and sends back normalized output, which leads to an IRI
>    with some non-normalized data after the '?'.

Why would you want to allow that?

> - Whenever there are no specific needs, use NFC. This in particular
>    applies when converting from a legacy encoding (such as iso-8859-1)
>    to an Unicode-based encoding
>
> Because of the extremely broad scope of URIs/IRIs, I'm not sure
> we can be more strict than that.
>
>
> >Proposal/Question:
> >
> >- would it make sense to deprecate URIs that can not be transformed to
> >IRIs (that is, those URIs that do contain %-escapes that do not map to
> >UTF-8 representations of fully normalized Unicode strings)?
>
> We can, and hopefully will, deprecate the creation of new URIs that
> contain such escapes. We cannot deprecate the use of such URIs where
> they already exist, because we don't want to have existing URIs to go
> away. Also, there may be some cases (kind of similar to the data: URI)
> that encode completely binary information. Although I haven't seen any
> actual examples, and I don't think many will turn up (the data: scheme
> encodes binary data as base64).

(puzzled) I don't understand this requirement. If you really need this you
can do this now with URIs and occasional percent encoding.

> All this is of course up to the revision of RFC 2396,
> rather than the IRI spec.

Yes.

To summarize: please keep the differences between URIs and IRIs as small as
possible to achieve I18N, in particular do *not* add ASCII characters that
are forbidden in URIs (as per RFC2396bis).

--
<green/>bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760
Received on Wednesday, 20 November 2002 16:10:50 UTC