W3C home > Mailing lists > Public > uri@w3.org > February 2004

Re: draft-fielding-uri-rfc2396bis-04.txt

From: Graham Klyne <GK@ninebynine.org>
Date: Tue, 17 Feb 2004 12:09:08 +0000
Message-Id: <>
To: "Roy T. Fielding" <fielding@gbiv.com>, uri@w3.org

At 03:48 16/02/04 -0800, Roy T. Fielding wrote:
>I have just submitted draft 04, which can also be obtained via
>the issues list at
>    http://gbiv.com/protocols/uri/rev-2002/issues.html

Modified: 16 February 2004 11:36:15
Size: 167.42 KB (171437 bytes)

Mostly editorial/nits, but maybe more serious are:
   "Section 2, general (inconsistency/editorial?):"
   "Section 6 (concern):"


Section 1.2.2, para 4 (nit):
When an author creates a reference to such a resource, they do so with the 
intention that the reference be used in the future; what is being 
identified is not some specific result that was obtained in the past, but 
rather some characteristic that is expected to be true for future results.

It seems odd to make claims about an author's intent in this way.  Suggestion:
... do so with the effect that the reference can be used in the future ...


Section 2.1, para 2 (suggestion):

Would it be helpful to include here a forward reference to section 2.4?


Section 2.2, para 2 (editorial):

I'm afraid I found this paragraph (immediately following the sub-delims 
production) really hard to understand.

As far as I can tell after reading the document through, the following 
points are being made:
- only characters in the reserved set may be used with special meaning in 
- characters in the gen-delims have special meaning in all URI schemes.
- characters in sub-delims may have special meaning in some URI schemes.
- characters in sub-delims that do not have special meaning in a URI scheme 
may be used unescaped in component values in that URI scheme
- a generic URI parser must treat all characters in the reserved set as 
distinct from the pct-encoded version of the same.

I think you're trying to say some more, but I can't figure what it is.


Section 2, general (inconsistency/editorial?):

I'm not sure if there is an inconsistency between:
This specification does not mandate the use of any particular character 
encoding scheme for mapping between URI characters and the octets used to 
store or transmit those characters.
]] -- (section 2 intro)

For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and 
%61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or 
tilde (%7E) should not be created by URI producers and, when found in a 
URI, should be decoded to their corresponding unreserved character by URI 
]] -- (section 2.3)
... and other references to specific escape sequences through the document.
Particularly, section 3.2.2 says "such octets must represent characters 
encoded in the UTF-8 ...".

Also in the intro to section 2, I think this refers just to interpretation 
of ABNF productions:
Therefore, the integer values must be mapped back to their corresponding 
characters via US-ASCII in order to complete the syntax rules.

I am concerned that on one hand the specification seems to say that it is 
agnostic with respect to the character encoding used, yet on the other hand 
it requires certain octet values expressed with %-encoding to be treated as 
specific characters.

Suppose I have:
in a URI that is presented using EBCDIC encoding?

On my third pass through the text, I now *think* the document is saying 
that %-encoded characters corresponding to the corresponding ASCII 
codepoints mentioned should not be presented in %-encoded form.

I think this area would usefully be clarified, maybe by adding something 
like this to section 2.1:
Percent encoding of characters is performed with respect to the character 
encoding in which the URI is presented.  In this specification, some 
special characters and percent encoding equivalences are described with 
respect to US-ASCII codepoints.  When a URI is presented in an encoding 
that is not a superset of US-ASCII, the percent encoding equivalences must 
be changed accordingly.

Further down, at the end of section 2.4, the document says that characters 
from the Unicode character set that are not in the unreserved set should be 
%-encoded via UTF-8 encoding.  But EBCDIC has characters in the Unicode 
character set (or indistinguishable from):  e.g. code 4A(hex) in EBCDIC is 
the same character as Unicode 00A2(hex) -- I think it's fair to expect 
these are not only the same glyph, but also convey the same semantics, 
though I defer to I18N experts.

I'm not quite sure what to suggest to avoid this confusion.  Maybe 
something like this in the section 2 introduction:
This specification does not mandate the use of any particular character 
encoding scheme for mapping between URI characters and the octets used to 
store or transmit those characters.  When a URI appears in a protocol 
element, the character encoding is defined by that protocol; absent such a 
definition, a URI is assumed to use the same character encoding as the 
surrounding text.  There is an expectation that Unicode-based formats will 
be common, and some special attention is paid to these.


Section 2.4, para 5 (editorial):

I'm not sure that the discussion of additional encoding, such as base64, 
really adds any value to the generic URI specification.  I'd be tempted to 
drop this paragraph, or maybe place it somewhere other than in this section 
discussing characters (e.g. in section 1.2.1)?


Section 3, para 2 (editorial):

I think it would be helpful (i.e. avoid requiring the reader to look 
forward for the "path" production) if the last sentence read something like 
In other words, if authority is present then the first segment of the path 
must be empty, and the path must start with a '/' character.


Section 3.2.1 and Appendix B (suggestion):

In section 3.2.1 the specification suggests:
Applications that render a URI for the sake of user feedback, such as in 
graphical hypertext browsing, should render userinfo in a way that is 
distinguished from the rest of a URI, when feasible.

Yet the regular expression in appendix B makes no attempt to separate the 
userinfo from the rest of the authority component.  Would it not be wise to 
encourage (by example) generic parsers to separate the userinfo in order 
that the above exhortation is more easily followed?

The current Haskell library URI module is implemented using an exact copy 
of the Regexp from RFC2396.  This example has impact on the ways that 
widely-used software components are actually implemented.


Section 3.2.2, IPvFuture production (suggestion):

Would it not be more futureproof if this were expressed as:
IPvFuture  = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )


Section 5.1, para 1 (editorial):

I think there's a subtlety here that isn't immediately obvious.  I think 
the text is correct, and maybe best left as is, but I'll mention the issue 

The document says:
Aside from fragment-only references (section 4.4), relative references are 
only usable when a base URI is known.

Which raised in my mind "useful for what?".  On reflection, I see that when 
used for selecting a document "view", given the discussion of same-document 
references, the meaning of a bare fragment does not depend on the base URI, 
since that would be the base URI of the encapsulating entity, whatever that 
may be.

Is it worth trying to make this point more obvious, something like:
In the absence of a base URI embedded in content, interpretation of a 
fragment-only URI is implicitly with respect to the base URI of the 
encapsulating entity.  Whatever that may be, such a URI is a "same document 
reference" (section 4.4), and may be used as-is for retrieval purposes.


Section 5.1.1, para 2 (nit):

I'm not aware that xml:base is part of or referenced by the application/xml 
media type specification.

Maybe, say:
The appropriate syntax, when available, is described by the data format 
specification associated with a media type.


Section 5.2

(skipped for now, but I hope to provide feedback based on implementation)


Section 6 (concern):

The document states "...comparison methods are designed to minimize false 
negatives while strictly avoiding false positives".

I'm concerned that some of the normalizations suggested might result in 
false equivalence.

Suggests case-normalization of the authority component.  But is there 
anything to prevent introduction of a new form of authority that *is* case 
sensitive?  I don't see any such implication in the discussion of reg-name 
(section 3.2.2).

I'm concerned about empty component normalization:
In RDF usage, the URIs:
would result in quite distinct resource identifiers, e.g. in:

Triples of the Data Model in N-Triples Format (Sub, Pred, Obj)

<http://www.w3.org/RDF/Validator/run/foo> <http://example.org/prop> "value" .

The original RDF/XML document

1: <?xml version="1.0"?>
2: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
3:   xmlns="http://example.org/">
4:   <rdf:Description rdf:about="foo">
5:     <prop>value</prop>
6:   </rdf:Description>
7: </rdf:RDF>


Triples of the Data Model in N-Triples Format (Sub, Pred, Obj)

<http://www.w3.org/RDF/Validator/run/foo> <http://example.org#prop> "value" .

The original RDF/XML document

1: <?xml version="1.0"?>
2: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
3:   xmlns="http://example.org#">
4:   <rdf:Description rdf:about="foo">
5:     <prop>value</prop>
6:   </rdf:Description>
7: </rdf:RDF>

(RDF triples generated by http://www.w3.org/RDF/Validator/)

Are distinct RDF graphs, even though the URIs are equivalent under the 
normalization rules given.


That's all.


Graham Klyne
For email:
Received on Tuesday, 17 February 2004 07:10:18 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:07 UTC