- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Tue, 26 Jul 2005 11:36:15 +0200
- To: uri@w3.org
Roy T. Fielding wrote: > The "idea" is that, if you have an old specification that depends > on those no-longer-used terms within their own grammar, then a reader > should be able to use the terms in D.2 in combination with the > others in the standard part of the specification to figure out > what the grammar in the old spec means within the constraints of > the current URI syntax. Okay, that's what I thought. So when I have an old spec. or an old draft like 2368bis (mailto:), then I'd expect <uric> to be what it always was, the complete set of all ASCII characters found in a URI, including <reserved>: >> 2396 URIC : ALNUM ! $ % & ' ( ) * + , - . / : ; = ? @ _ ~ >> 3986 URIC3: ALNUM ! # $ % & ' ( ) * + , - . / : ; = ? @ [ ] _ ~ > That would make prior grammars incorrect -- we would need to exclude > "#[]" to make it work. The characters "!'()*" could be added to D.2. Here I'm lost again, "[" and "]" are now valid URI characters, they're used for <IP-literal> and IPv6. It's less clear for "#", the old RfCs had another strategy, fragments were considered as local business not really belonging to the URI. IMO the STD 66 approach is much better, but retrofitting "#" into old grammars could be a problem. But if I want to know what an old spec. means under the new rules, where does it break for "[" or "]" used to delimit an <IPv6address> ? > Note that this only effects the interpretation of old grammars, not > the current syntax. Yes, some old 1738 schemes still wait to be updated. No direct problem with <uric> there, but the opposite "unsafe" is important. RfC 2396 mentions ftp:, gopher:, mailto:, news:, and telnet:, these schemes are indirectly affected by whatever <uric> is today: 3986 D.2: ALNUM $ % & + , - . / : ; = ? @ _ ~ your idea: ALNUM ! $ % & ' ( ) * + , - . / : ; = ? @ _ ~ my idea: ALNUM ! # $ % & ' ( ) * + , - . / : ; = ? @ [ ] _ ~ excl. "#": ALNUM ! $ % & ' ( ) * + , - . / : ; = ? @ [ ] _ ~ [updated <mark>] >> 1738 UNRESERVED1: ALNUM ! $ ' ( ) * + , - . _ >> 2396 UNRESERVED2: ALNUM ! ' ( ) * - . _ ~ >> 1738 SAFE_EXTRA: ! $ ' ( ) * + , - . _ >> 2396 MARK : ! ' ( ) * - . _ ~ >> In other words <mark> is the same as <unreserved> excluding >> <alphanum>. >> 2396 UNRESERVED2: ALNUM ! ' ( ) * - . _ ~ >> 3986 UNRESERVED3: ALNUM - . _ ~ >> 2396 MARK : ! ' ( ) * - . _ ~ >> 3986 MARK3: ! ' ( ) * - . _ ~ >> In 3986 D.2 it's the same old <mark>, no proper subset of >> <unreserved>. IMHO it should be only "-", ".", "_", "~". > I would have to see the effect on grammars within specs dependent > on 2396 and 1738. Yes, it's not obvious. OTOH 3986 moved "!", "'", "(", ")", and "*" from <unreserved> to <reserved>. The concept of <mark> or the older <safe> plus <extra> was apparently: "No problem if you use <mark> within an URI were needed. It's not <reserved>, let alone <unwise> or <delims> (formerly known as 'unsafe' together with CTL and SP)." That results in two possible cases for old spec.s: I: Old spec. reserved one or more of "!'()*" for its own purposes. That would explain why 3986 added it to <reserved>, no problem. II: Old spec. didn't use one or more of <mark> for its purposes, also no problem, "!'()*" are only <sub-delims>. They can be used freely outside of contexts where they would be reserved. Is there any theoretical case left, where this interpretation could break something in an old spec. ? >> 2396 DELIM_UNWISE: " # % < > [ \ ] ^ ` { | } >> 3986 NOURIC3 : " < > \ ^ ` { | } [...] > Those terms are not used by dependent specifications and therefore > do not need equivalents in D.2. If we can agree on an updated <uric> it would be obvious, anything else is no <uric>. For the purpose of D.2 that could be modulo "#". Otherwise it's difficult to interpret this statement in chapter 2: | A URI is composed from a limited set of characters consisting of | digits, letters, and a few graphic symbols. This can't be a very convoluted version of VCHAR, there must be at least one VCHAR that's definitely not allowed. My best guess are "<", ">", "\", "^", "`", "{", "|", "}", and DQUOTE. Unfortunately "\" and DQUOTE are not only a theoretical problem, they can bite in local parts of mail addresses incl. Message-IDs. Bye, Frank
Received on Tuesday, 26 July 2005 09:44:44 UTC