Re: Comments on draft-hansen-iri-4395bis-irireg-00.txt from Martin J. Dürst on 2010-10-05 (public-iri@w3.org from October 2010)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 05 Oct 2010 18:14:45 +0900
To: Bjoern Hoehrmann <derhoermi@gmx.net>
CC: public-iri@w3.org
Message-ID: <4CAAEC85.8040109@it.aoyama.ac.jp>
Hello Björn,

I'm trying to understand the main point of your mail.

On 2010/10/05 13:43, Bjoern Hoehrmann wrote:

>    http://tools.ietf.org/html/draft-hansen-iri-4395bis-irireg-00 notes
> "Previously, those who wish to describe resource identifiers that are
> useful as IRIs were encouraged to define the corresponding URI syntax,
> and note that the IRI usage follows the rules and transformations
> defined in [6]. This document changes that advice to encourage explicit
> definition of the scheme and allowable syntax elements within the larger
> character repertoire of IRIs, as defined by [7]."

> I am concerned that this would further draw a distinction between the
> characters that occur literally in an identifier and characters that
> are percent-encoded. I am not entirely sure in fact how to read RFC
> 3987 on this (it starts out saying it's just like URIs, except that
> there are more unreserved characters,

Yes.

> but then excludes private use
> code points from the set of unreserved characters).

Well, yes. I don't understand what point you are trying to make here. 
Even if the private use codepoints are excluded, there are way more 
characters that you can use than for US-ASCII.


> Let's say I make a scheme where the scheme-specific part can only be
> "ö". Since "ö" is an unreserved character, I might be inclined to say
>
>    def = "example:" %x00F6;
>
> but that would not work as "example:%c3%b6" is essentially defined as
> equivalent to "example:ö". The definition would have to account for a
> level of indirection at some point to remove percent-encoding, so I'd
> think you cannot quite distinguish between defining an URI scheme and
> an IRI scheme,

Is what you want to say here that any (IRI) scheme definition has to 
make sure that the syntax includes (UTF-8-based) percent-encoding 
fallbacks for all the non-ASCII characters that are in the syntax?
That is definitely important because otherwise, conversion of your 
"example:ö" IRI to the URI "example:%C3%B6" (upper-case for hex is 
preferred in URIs, so I'm using that) may not be allowed, and also 
"example:%C3%B6" may not be allowed as an IRI (e.g. in a Web page in 
Shift_JIS, where "ö" cannot be expressed directly. Given that theory, 
your scheme would have to be defined as:

def = "example:" (%x00F6 / "%C3%B6")

In that simple case, that wouldn't be too much trouble. But we can 
imagine some more realistic schemes where the grammar might blow up quickly.

So I think we also should consider other solutions. One solution would 
be to define the syntax only in terms of UCS characters (i.e. IRI), and 
specify that any percent-escaping of the allowed UCS characters is also 
allowed. This could be done on a per-scheme base, or could be declared a 
general rule (currently, it's pretty much something that follows from 
RFC 3987, but I don't think it's explicit anywhere).


> so far the only difference could be in percent-encoded
> private use characters.

Are you saying that when you explicitly allow <pct-encoded>, and you 
also have <iunreserved>, then the only thing you add are private use 
codepoints? That's actually not completely true, you also add C0 and C1 
controls and <reserved>.


> I'd rather remove that difference, and am not
> sure what the actual change there would be.

Do you mean you want to allow private use codepoints when you define a 
scheme such as:

def = "example:" %x00F6;

Or under some other circumstances?

Sorry for that many questions (some of which might look silly to you); 
just trying to make sure we understand each other.


Regards,   Martin.


-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 5 October 2010 09:16:15 UTC