Comments on the WD - A proposed alternative from Arjun Ray on 2000-02-21 (www-xml-canonicalization-comments@w3.org from February 2000)

From: Arjun Ray <aray@q2.net>
Date: Sun, 20 Feb 2000 19:37:18 -0500 (EST)
To: www-xml-canonicalization-comments@w3.org
Message-ID: <Pine.LNX.4.10.10002201907510.1475-100000@mail.q2.net>

The introduction says:

: This Canonical XML specification aims to introduce a notion of
: equivalence between XML documents which can be tested at the
: syntactic level and, in particular, by byte-for-byte comparison. In
: the syntax it describes, logically equivalent documents are
: byte-for-byte identical.

While the result (determination of byte-for-byte congruence) is met by
the syntax of the WD, I have two concerns:

(1) Line-feed (#xA) characters are introduced in regions that are
generally understood as data content - after the PIC of processing
instructions.  Does this not mean that the parse of the canonical form
could be different from that of the original, in terms of whitespace
reporting to the application?  

(2) An opportunity seems to have been foregone to make other kinds of
comparison techniques easy to exploit.  I have the UN*X 'diff' command
in mind, specifically.  It works with the present format, but not
necessarily at an easily used granularity - mainly because more than
one information item can occur on the same line.

I believe a line-oriented approach to canonicalizing the *markup* of
the document offers just as many advantages as the current proposal,
eliminates the factitious line-feeds after PIs, and offers "low-tech"
benefits to the eponymous DPH and his harried brethren.

The alternative retains from the current proposal all rules regarding

  1.  Whitespace normalization in "informative" data.
  2.  Character escaping.
  3.  Namespace renaming and propagation to subelements, etc.
  4.  Lexicographic ordering of attributes.

(and any others I missed:))

The difference is in how tags and PIs are represented.  Specifically

   1.  These are immediately followed by a newline:
         a.  The generic identifier of a start-tag.
         b.  The generic identifier of an end-tag.
         c.  The target of a PI.

   2.   Each attribute specification is on a separate line (i.e.
        ends with a #xA.)

   3.   These all start on a new line:
         a.   The '>' or '/>' of a start-tag (as a consequence of
              Rules 1 and 2).
         b.   The '>' of an end-tag (from Rule 1).
         c.   The '?>' terminating a PI, usually by the insertion
              of an immediately preceding #xA.

In eliminating the mew-lines following PIs in the current proposal,
and significantly enhancing the utility of line-oriented text
processing tools in dealing with canonicalized documents, I believe
this alternative is worth considering.

That is, if I haven't missed something crushingly obvious:)


Arjun

Received on Sunday, 20 February 2000 19:11:24 UTC