- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Tue, 05 Oct 2010 18:14:45 +0900
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: public-iri@w3.org
Hello Björn, I'm trying to understand the main point of your mail. On 2010/10/05 13:43, Bjoern Hoehrmann wrote: > http://tools.ietf.org/html/draft-hansen-iri-4395bis-irireg-00 notes > "Previously, those who wish to describe resource identifiers that are > useful as IRIs were encouraged to define the corresponding URI syntax, > and note that the IRI usage follows the rules and transformations > defined in [6]. This document changes that advice to encourage explicit > definition of the scheme and allowable syntax elements within the larger > character repertoire of IRIs, as defined by [7]." > I am concerned that this would further draw a distinction between the > characters that occur literally in an identifier and characters that > are percent-encoded. I am not entirely sure in fact how to read RFC > 3987 on this (it starts out saying it's just like URIs, except that > there are more unreserved characters, Yes. > but then excludes private use > code points from the set of unreserved characters). Well, yes. I don't understand what point you are trying to make here. Even if the private use codepoints are excluded, there are way more characters that you can use than for US-ASCII. > Let's say I make a scheme where the scheme-specific part can only be > "ö". Since "ö" is an unreserved character, I might be inclined to say > > def = "example:" %x00F6; > > but that would not work as "example:%c3%b6" is essentially defined as > equivalent to "example:ö". The definition would have to account for a > level of indirection at some point to remove percent-encoding, so I'd > think you cannot quite distinguish between defining an URI scheme and > an IRI scheme, Is what you want to say here that any (IRI) scheme definition has to make sure that the syntax includes (UTF-8-based) percent-encoding fallbacks for all the non-ASCII characters that are in the syntax? That is definitely important because otherwise, conversion of your "example:ö" IRI to the URI "example:%C3%B6" (upper-case for hex is preferred in URIs, so I'm using that) may not be allowed, and also "example:%C3%B6" may not be allowed as an IRI (e.g. in a Web page in Shift_JIS, where "ö" cannot be expressed directly. Given that theory, your scheme would have to be defined as: def = "example:" (%x00F6 / "%C3%B6") In that simple case, that wouldn't be too much trouble. But we can imagine some more realistic schemes where the grammar might blow up quickly. So I think we also should consider other solutions. One solution would be to define the syntax only in terms of UCS characters (i.e. IRI), and specify that any percent-escaping of the allowed UCS characters is also allowed. This could be done on a per-scheme base, or could be declared a general rule (currently, it's pretty much something that follows from RFC 3987, but I don't think it's explicit anywhere). > so far the only difference could be in percent-encoded > private use characters. Are you saying that when you explicitly allow <pct-encoded>, and you also have <iunreserved>, then the only thing you add are private use codepoints? That's actually not completely true, you also add C0 and C1 controls and <reserved>. > I'd rather remove that difference, and am not > sure what the actual change there would be. Do you mean you want to allow private use codepoints when you define a scheme such as: def = "example:" %x00F6; Or under some other circumstances? Sorry for that many questions (some of which might look silly to you); just trying to make sure we understand each other. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 5 October 2010 09:16:15 UTC