- From: Thomas Roessler <tlr@w3.org>
- Date: Wed, 14 Nov 2007 12:43:24 +0100
- To: public-powderwg@w3.org, mdw@w3.org, parcher@icra.org
- Cc: pgrosso@ptc.com, ht@w3.org, duerst@it.aoyama.ac.jp
Skimming over [1], I notice that POWDER has its own set of URI/IRI canonicalization rules. These rules (and the remarks in 2.1.4 about re presenting URIs in XML content) relate to a number of work items elsewhere, including the LEIRI-related discussions going on in the XML Core Working Group, the material on canonicalization and normalization in RFCs 3986 and 3987, and ongoing work on RFC 3987bis. As the canonicalization of URIs/IRIs (and representation of URIs/IRIs in XML content) is a somewhat messy area, I would recommend that the POWDER WG solicit feed-back from both XML Core, and from Martin Duerst, who is an editor of a forthcoming updated IRI specification. The overall approach currently taken seems to aim at reconstructing some reasonably well-defined IRI representation from whatever [IU]RIs or [IU]RI references are found, minimizing the amount of % encoding that is found in the process. Some specific points that will need more attention with this approach; there are probably more: - The normative language in 2.1.3 speaks of "URIs"; however, these can apparently lack a scheme. That suggests that the author actually had (possibly relative) URI references in mind. If so, absolutization is a more general problem with a dependency on the base URI, and just prepending "http" is wrong; see section 5.2 of [2] for details. If you indeed consider URIs only, the part about adding a schema is a no-op. - 2.1.3 and 2.1.4 seem to be specifying a scheme-aware canonicalization for http (and maybe https) URIs; however, some aspects of the canonicalization are more generic. At the same time, the text doesn't actually say whether (or where) it is http-specific. It would probably be worth specifying (a) generic normalization rules (e.g., processing of %-encoded sequences), and (b) specific rules for certain schemes (e.g., processing of the authority for schemes where the authority includes a domain name and a port; appending an empty path, ...). - 2.1.4, deals with processing of XML character entities. Since POWDER does not deal with XML documents as character streams, but rather with their information content, that seems misplaced. (In fact, if this kind of processing would be applied outside an XML context, it might lead to erroneous changes.) - It is not clear to me whether you want to consider a human-readable representation in case of internationalized domain names occuring in the authority part of the URI, or whether you consider the ASCII representation in that case. This specifically affects the case-insensitive part of the host name comparison (only US-ASCII domain names are case-insensitive). - "Percent encoded triples are converted into the characters they represent"; in 2.1.3, you give '/' as an example. Have a look at section 6.2.2.2 of RFC 3986. There are good reasons why that text suggests decoding of *unreserved* characters only. 1. http://www.w3.org/TR/2007/WD-powder-grouping-20071031/#canon 2. http://www.ietf.org/rfc/rfc3986.txt Regards, -- Thomas Roessler, W3C <tlr@w3.org>
Received on Wednesday, 14 November 2007 11:43:48 UTC