RE: URI schemes and IRI deployment (issue schemes-iri-38) from Martin Duerst on 2004-09-16 (public-iri@w3.org from September 2004)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 16 Sep 2004 16:01:07 +0900
To: "Williams, Stuart" <skw@hp.com>
Cc: public-iri@w3.org, Ted Hardie <hardie@qualcomm.com>
Message-Id: <4.2.0.58.J.20040916153442.05490c20@localhost>
Hello Stuart,

Sorry for the delay in responding to your mail.

At 10:49 04/08/19 +0100, Williams, Stuart wrote:

>Hello Martin,
>
> > -----Original Message-----
> > From: public-iri-request@w3.org
> > [mailto:public-iri-request@w3.org] On Behalf Of Martin Duerst
> > Sent: 18 August 2004 08:03
> > To: Williams, Stuart
> > Cc: public-iri@w3.org; Ted Hardie
> > Subject: RE: URI schemes and IRI deployment (issue schemes-iri-38)
> >
> >
>
><snip/>

> > Okay, it looks like I wasn't precise enough. Let me try a
> > proposal for rewording, for the middle sentence in the
> > paragraph above:
> >
> > "The main case where upgrading a scheme definition makes
> > sense is when a scheme definition is strictly limited to the
> > use of US-ASCII characters with no provision to include
> > non-ASCII characters/octets via percent-encoding, or if a
> > scheme definition currently uses highly scheme-specific
> > provisions for the encoding of non-ASCII characters."
> >
> > Would that be better? Would the changes below still be necessary?
> > I wouldn't want to replace the above with your text below,
> > because your text below says nothing about schemes that may
> > or may not have to be upgraded.
>
>Hmmm... if you were to include the reworded para below (I've agreed to the
>rewording - fewer  'generally's) I think you could simply delete this
>paragraph.

I have thought about that. I think the current paragraph,
talking about upgrades, is valuable in its own right, although
it is not the issue you have raised. So I'll leave that in.


>On the surface it is ok, but if I were to say, think of upgrading
>an existing scheme and saying that going forward, %-encoded characters
>should be interpreted as UTF-8 I find myself wondering about backward
>compatibility issues, where %-encoding may have been used in identifiers
>without that intended interpretation. I'm not at all sure how possible it is
>to 'upgrade' any URI scheme.

If it was the case that %-encoding was used with a fixed character
semantics that is different from UTF-8 (let's take iso-8859-1 as an
example), then you are right [I don't know of such a scheme, but that
doesn't mean that it might not exist.]. In practice, it may still
be possible to add such semantics for newly created URIs because
there are very good heuristics for UTF-8.

Also, if %-encoding was used without any defined character semantics
(typical example: HTTP), then it would be impossible to force UTF-8
character semantics on %-encoding. Again, in practice, a scheme definition
may be updated to say something like 'if it looks like UTF-8, assume
it's UTF-8'.

Anyway, that's why the text in the draft is very careful to limit this
to the case where a scheme (or a part thereoff) does not allow %-encoding,
or uses other conventions for encoding non-ASCII characters.
In these cases, %-encoding is essentially added as new syntax to the
scheme. The benefits of extending the syntax of a scheme have to
be judged carefully, but it's not something that is a priory
impossible.



> > "URI schemes can impose restrictions on the syntax of
> > scheme-specific URIs, ie. URIs that are admissable under the
> > generic URI syntax [RFCYYYY] may not be admissable due to
> > narrower syntactic constraints imposed by a URI scheme
> > specification. URI scheme definitions cannot broaden the
> > syntactic restrictions of the generic URI syntax, otherwise
> > it would be possible to generate URIs that satisfied the
> > scheme specific syntactic constraints without satisfying the
> > syntactic constraints of the generic URI syntax. However,
> > additional syntactic constraints imposed by URI scheme
> > specifications are *indirectly* applicable to IRI since the
> > corresponding URI resulting from the mapping defined in
> > Section 3.1 MUST be a valid URI under the syntactic
> > restrictions of generic URI syntax and any narrower
> > restrictions imposed by the corresponding URI scheme
> > specification."
>
>Inclusion of this paragraph, as reworded above, would address my concern.

I have included this paragraph. I think this is material
that should end up in the 'guidelines for new URI schemes'
or whatever it will be called, and once it end up there,
we may be able to remove it from here, but for the moment,
it doesn't hurt.


>Well... I think it needs to be clear to readers of the IRI spec that no
>magic happens that automatically enables them to create schemes that allow
>the *direct* inclusion of a wider range of characters in scheme definitions.
>I made my initial comment after a discussion with Tim Kindberg wrt to the
>tag: URI scheme in draft. He was confused about what he could/could not do
>wrt to internationalisation on defining that scheme. For his purposes he
>would (I believe) like to be able to allow the direct use of
>internationalized characters, and the %encoding. Passed around as IRI Tim
>would get what he wants (provided me makes appropiate statements/references
>about %encoding and UTF-8).

I agree. Your pointer to Tim's draft helped me a lot understanding
what you were looking for.

I have tentatively closed this issue. Please see
http://www.w3.org/International/iri-edit/diff-duerst-iri-last-draft.html
for the overall changes, and tell me whether you are okay, as soon as
possible.

Regards,     Martin.
Received on Thursday, 16 September 2004 07:01:23 UTC