- From: Graham Klyne <GK@ninebynine.org>
- Date: Tue, 17 Feb 2004 12:09:08 +0000
- To: "Roy T. Fielding" <fielding@gbiv.com>, uri@w3.org
At 03:48 16/02/04 -0800, Roy T. Fielding wrote: >I have just submitted draft 04, which can also be obtained via >the issues list at > > http://gbiv.com/protocols/uri/rev-2002/issues.html Reviewing: http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html Modified: 16 February 2004 11:36:15 Size: 167.42 KB (171437 bytes) Mostly editorial/nits, but maybe more serious are: "Section 2, general (inconsistency/editorial?):" "Section 6 (concern):" ... Section 1.2.2, para 4 (nit): [[ When an author creates a reference to such a resource, they do so with the intention that the reference be used in the future; what is being identified is not some specific result that was obtained in the past, but rather some characteristic that is expected to be true for future results. ]] It seems odd to make claims about an author's intent in this way. Suggestion: [[ ... do so with the effect that the reference can be used in the future ... ]] ... Section 2.1, para 2 (suggestion): Would it be helpful to include here a forward reference to section 2.4? ... Section 2.2, para 2 (editorial): I'm afraid I found this paragraph (immediately following the sub-delims production) really hard to understand. As far as I can tell after reading the document through, the following points are being made: - only characters in the reserved set may be used with special meaning in URIs. - characters in the gen-delims have special meaning in all URI schemes. - characters in sub-delims may have special meaning in some URI schemes. - characters in sub-delims that do not have special meaning in a URI scheme may be used unescaped in component values in that URI scheme - a generic URI parser must treat all characters in the reserved set as distinct from the pct-encoded version of the same. I think you're trying to say some more, but I can't figure what it is. ... Section 2, general (inconsistency/editorial?): I'm not sure if there is an inconsistency between: [[ This specification does not mandate the use of any particular character encoding scheme for mapping between URI characters and the octets used to store or transmit those characters. ]] -- (section 2 intro) and: [[ For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved character by URI normalizers. ]] -- (section 2.3) ... and other references to specific escape sequences through the document. Particularly, section 3.2.2 says "such octets must represent characters encoded in the UTF-8 ...". Also in the intro to section 2, I think this refers just to interpretation of ABNF productions: [[ Therefore, the integer values must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules. ]] I am concerned that on one hand the specification seems to say that it is agnostic with respect to the character encoding used, yet on the other hand it requires certain octet values expressed with %-encoding to be treated as specific characters. Suppose I have: foo://some.example.com/%7Epath in a URI that is presented using EBCDIC encoding? On my third pass through the text, I now *think* the document is saying that %-encoded characters corresponding to the corresponding ASCII codepoints mentioned should not be presented in %-encoded form. I think this area would usefully be clarified, maybe by adding something like this to section 2.1: [[ Percent encoding of characters is performed with respect to the character encoding in which the URI is presented. In this specification, some special characters and percent encoding equivalences are described with respect to US-ASCII codepoints. When a URI is presented in an encoding that is not a superset of US-ASCII, the percent encoding equivalences must be changed accordingly. ]] Further down, at the end of section 2.4, the document says that characters from the Unicode character set that are not in the unreserved set should be %-encoded via UTF-8 encoding. But EBCDIC has characters in the Unicode character set (or indistinguishable from): e.g. code 4A(hex) in EBCDIC is the same character as Unicode 00A2(hex) -- I think it's fair to expect these are not only the same glyph, but also convey the same semantics, though I defer to I18N experts. I'm not quite sure what to suggest to avoid this confusion. Maybe something like this in the section 2 introduction: [[ This specification does not mandate the use of any particular character encoding scheme for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; absent such a definition, a URI is assumed to use the same character encoding as the surrounding text. There is an expectation that Unicode-based formats will be common, and some special attention is paid to these. ]] ... Section 2.4, para 5 (editorial): I'm not sure that the discussion of additional encoding, such as base64, really adds any value to the generic URI specification. I'd be tempted to drop this paragraph, or maybe place it somewhere other than in this section discussing characters (e.g. in section 1.2.1)? ... Section 3, para 2 (editorial): I think it would be helpful (i.e. avoid requiring the reader to look forward for the "path" production) if the last sentence read something like this: [[ In other words, if authority is present then the first segment of the path must be empty, and the path must start with a '/' character. ]] ... Section 3.2.1 and Appendix B (suggestion): In section 3.2.1 the specification suggests: [[ Applications that render a URI for the sake of user feedback, such as in graphical hypertext browsing, should render userinfo in a way that is distinguished from the rest of a URI, when feasible. ]] Yet the regular expression in appendix B makes no attempt to separate the userinfo from the rest of the authority component. Would it not be wise to encourage (by example) generic parsers to separate the userinfo in order that the above exhortation is more easily followed? <background> The current Haskell library URI module is implemented using an exact copy of the Regexp from RFC2396. This example has impact on the ways that widely-used software components are actually implemented. </background> ... Section 3.2.2, IPvFuture production (suggestion): Would it not be more futureproof if this were expressed as: [[ IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) ]] ... Section 5.1, para 1 (editorial): I think there's a subtlety here that isn't immediately obvious. I think the text is correct, and maybe best left as is, but I'll mention the issue anyway. The document says: [[ Aside from fragment-only references (section 4.4), relative references are only usable when a base URI is known. ]] Which raised in my mind "useful for what?". On reflection, I see that when used for selecting a document "view", given the discussion of same-document references, the meaning of a bare fragment does not depend on the base URI, since that would be the base URI of the encapsulating entity, whatever that may be. Is it worth trying to make this point more obvious, something like: [[ In the absence of a base URI embedded in content, interpretation of a fragment-only URI is implicitly with respect to the base URI of the encapsulating entity. Whatever that may be, such a URI is a "same document reference" (section 4.4), and may be used as-is for retrieval purposes. ]] ... Section 5.1.1, para 2 (nit): I'm not aware that xml:base is part of or referenced by the application/xml media type specification. Maybe, say: [[ The appropriate syntax, when available, is described by the data format specification associated with a media type. ]] ... Section 5.2 (skipped for now, but I hope to provide feedback based on implementation) ... Section 6 (concern): The document states "...comparison methods are designed to minimize false negatives while strictly avoiding false positives". I'm concerned that some of the normalizations suggested might result in false equivalence. Section 6.2.2.1: Suggests case-normalization of the authority component. But is there anything to prevent introduction of a new form of authority that *is* case sensitive? I don't see any such implication in the discussion of reg-name (section 3.2.2). Section 6.2.2.3: I'm concerned about empty component normalization: In RDF usage, the URIs: http://example.org/ and http://example.org# would result in quite distinct resource identifiers, e.g. in: [[ Triples of the Data Model in N-Triples Format (Sub, Pred, Obj) <http://www.w3.org/RDF/Validator/run/foo> <http://example.org/prop> "value" . The original RDF/XML document 1: <?xml version="1.0"?> 2: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 3: xmlns="http://example.org/"> 4: <rdf:Description rdf:about="foo"> 5: <prop>value</prop> 6: </rdf:Description> 7: </rdf:RDF> ]] and [[ Triples of the Data Model in N-Triples Format (Sub, Pred, Obj) <http://www.w3.org/RDF/Validator/run/foo> <http://example.org#prop> "value" . The original RDF/XML document 1: <?xml version="1.0"?> 2: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 3: xmlns="http://example.org#"> 4: <rdf:Description rdf:about="foo"> 5: <prop>value</prop> 6: </rdf:Description> 7: </rdf:RDF> ]] (RDF triples generated by http://www.w3.org/RDF/Validator/) Are distinct RDF graphs, even though the URIs are equivalent under the normalization rules given. ... That's all. #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact
Received on Tuesday, 17 February 2004 07:10:18 UTC