[comment] URI/IRI canonicalization in powder-grouping from Thomas Roessler on 2007-11-14 (public-powderwg@w3.org from November 2007)

From: Thomas Roessler <tlr@w3.org>
Date: Wed, 14 Nov 2007 12:43:24 +0100
To: public-powderwg@w3.org, mdw@w3.org, parcher@icra.org
Cc: pgrosso@ptc.com, ht@w3.org, duerst@it.aoyama.ac.jp
Message-ID: <20071114114324.GA17789@raktajino.does-not-exist.org>

Skimming over [1], I notice that POWDER has its own set of URI/IRI
canonicalization rules.  These rules (and the remarks in 2.1.4 about
re presenting URIs in XML content) relate to a number of work items
elsewhere, including the LEIRI-related discussions going on in the
XML Core Working Group, the material on canonicalization and
normalization in RFCs 3986 and 3987, and ongoing work on RFC
3987bis.

As the canonicalization of URIs/IRIs (and representation of
URIs/IRIs in XML content) is a somewhat messy area, I would
recommend that the POWDER WG solicit feed-back from both XML Core,
and from Martin Duerst, who is an editor of a forthcoming updated
IRI specification.

The overall approach currently taken seems to aim at reconstructing
some reasonably well-defined IRI representation from whatever
[IU]RIs or [IU]RI references are found, minimizing the amount of %
encoding that is found in the process.

Some specific points that will need more attention with this
approach; there are probably more:

- The normative language in 2.1.3 speaks of "URIs"; however, these
  can apparently lack a scheme.

  That suggests that the author actually had (possibly relative) URI
  references in mind.  If so, absolutization is a more general
  problem with a dependency on the base URI, and just prepending
  "http" is wrong; see section 5.2 of [2] for details.

  If you indeed consider URIs only, the part about adding a schema
  is a no-op.

- 2.1.3 and 2.1.4 seem to be specifying a scheme-aware
  canonicalization for http (and maybe https) URIs; however, some
  aspects of the canonicalization are more generic.  At the same
  time, the text doesn't actually say whether (or where) it is
  http-specific.

  It would probably be worth specifying (a) generic normalization
  rules (e.g., processing of %-encoded sequences), and (b) specific
  rules for certain schemes (e.g., processing of the authority for
  schemes where the authority includes a domain name and a port;
  appending an empty path, ...).

- 2.1.4, deals with processing of XML character entities.  Since
  POWDER does not deal with XML documents as character streams, but
  rather with their information content, that seems misplaced.

  (In fact, if this kind of processing would be applied outside an
  XML context, it might lead to erroneous changes.)

- It is not clear to me whether you want to consider a
  human-readable representation in case of internationalized domain
  names occuring in the authority part of the URI, or whether you
  consider the ASCII representation in that case.

  This specifically affects the case-insensitive part of the host
  name comparison (only US-ASCII domain names are case-insensitive).

- "Percent encoded triples are converted into the characters they
  represent"; in 2.1.3, you give '/' as an example.  Have a look at
  section 6.2.2.2 of RFC 3986.  There are good reasons why that text
  suggests decoding of *unreserved* characters only.

1. http://www.w3.org/TR/2007/WD-powder-grouping-20071031/#canon
2. http://www.ietf.org/rfc/rfc3986.txt

Regards,
-- 
Thomas Roessler, W3C  <tlr@w3.org>

Received on Wednesday, 14 November 2007 11:43:48 UTC