Re: STD 66 questions (problems ?) from Frank Ellermann on 2005-07-26 (uri@w3.org from July 2005)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Tue, 26 Jul 2005 11:36:15 +0200
To: uri@w3.org
Message-ID: <42E6040F.559D@xyzzy.claranet.de>
Roy T. Fielding wrote:

> The "idea" is that, if you have an old specification that depends
> on those no-longer-used terms within their own grammar, then a reader
> should be able to use the terms in D.2 in combination with the
> others in the standard part of the specification to figure out
> what the grammar in the old spec means within the constraints of
> the current URI syntax.

Okay, that's what I thought.  So when I have an old spec. or an old
draft like 2368bis (mailto:), then I'd expect <uric> to be what it
always was, the complete set of all ASCII characters found in a URI,
including <reserved>:

>> 2396 URIC : ALNUM !   $ % & ' ( ) * + , - . / : ; = ? @     _ ~
>> 3986 URIC3: ALNUM ! # $ % & ' ( ) * + , - . / : ; = ? @ [ ] _ ~

> That would make prior grammars incorrect -- we would need to exclude
> "#[]" to make it work.  The characters "!'()*" could be added to D.2.

Here I'm lost again, "[" and "]" are now valid URI characters, they're
used for <IP-literal> and IPv6.

It's less clear for "#", the old RfCs had another strategy, fragments
were considered as local business not really belonging to the URI.

IMO the STD 66 approach is much better, but retrofitting "#" into old
grammars could be a problem.  But if I want to know what an old spec.
means under the new rules, where does it break for "[" or "]" used to
delimit an <IPv6address> ?

> Note that this only effects the interpretation of old grammars, not
> the current syntax.

Yes, some old 1738 schemes still wait to be updated.  No direct
problem with <uric> there, but the opposite "unsafe" is important.

RfC 2396 mentions ftp:, gopher:, mailto:, news:, and telnet:, these
schemes are indirectly affected by whatever <uric> is today:

 3986 D.2: ALNUM     $ % &         + , - . / : ; = ? @     _ ~
your idea: ALNUM !   $ % & ' ( ) * + , - . / : ; = ? @     _ ~
  my idea: ALNUM ! # $ % & ' ( ) * + , - . / : ; = ? @ [ ] _ ~
excl. "#": ALNUM !   $ % & ' ( ) * + , - . / : ; = ? @ [ ] _ ~

  [updated <mark>]
>> 1738 UNRESERVED1: ALNUM ! $ ' ( ) * + , - . _
>> 2396 UNRESERVED2: ALNUM !   ' ( ) *     - . _ ~
>> 1738 SAFE_EXTRA: ! $ ' ( ) * + , - . _
>> 2396 MARK      : !   ' ( ) *     - . _ ~

>> In other words <mark> is the same as <unreserved> excluding
>> <alphanum>.

>> 2396 UNRESERVED2: ALNUM ! ' ( ) * - . _ ~
>> 3986 UNRESERVED3: ALNUM           - . _ ~
>> 2396 MARK : ! ' ( ) * - . _ ~
>> 3986 MARK3: ! ' ( ) * - . _ ~

>> In 3986 D.2 it's the same old <mark>, no proper subset of
>> <unreserved>.  IMHO it should be only "-", ".", "_", "~".

> I would have to see the effect on grammars within specs dependent
> on 2396 and 1738.

Yes, it's not obvious.  OTOH 3986 moved "!", "'", "(", ")", and "*"
from <unreserved> to <reserved>.  The concept of <mark> or the older
<safe> plus <extra> was apparently:   "No problem if you use <mark>
within an URI were needed.  It's not <reserved>, let alone <unwise>
or <delims> (formerly known as 'unsafe' together with CTL and SP)."

That results in two possible cases for old spec.s:

I:   Old spec. reserved one or more of "!'()*" for its own purposes.
     That would explain why 3986 added it to <reserved>, no problem.
II:  Old spec. didn't use one or more of <mark> for its purposes,
     also no problem, "!'()*" are only <sub-delims>.  They can be
     used freely outside of contexts where they would be reserved.

Is there any theoretical case left, where this interpretation could
break something in an old spec. ?

>> 2396 DELIM_UNWISE: " # % < > [ \ ] ^ ` { | }
>> 3986 NOURIC3     : "     < >   \   ^ ` { | }
[...]
> Those terms are not used by dependent specifications and therefore
> do not need equivalents in D.2.

If we can agree on an updated <uric> it would be obvious, anything
else is no <uric>.  For the purpose of D.2 that could be modulo "#".

Otherwise it's difficult to interpret this statement in chapter 2:

| A URI is composed from a limited set of characters consisting of
| digits, letters, and a few graphic symbols.

This can't be a very convoluted version of VCHAR, there must be at
least one VCHAR that's definitely not allowed.  My best guess are
"<", ">", "\", "^", "`", "{", "|", "}", and DQUOTE.

Unfortunately "\" and DQUOTE are not only a theoretical problem,
they can bite in local parts of mail addresses incl. Message-IDs.

                             Bye, Frank
Received on Tuesday, 26 July 2005 09:44:44 UTC