DChanges: Comparability of Higher-Level Documents

I think there are some use cases and requirements here.  

 1. THE SITUATION

There are circumstances where it must be possible to assess the integrity and provenance of a change-tracked document.  The case I have in mind is when there is an earlier (possibly change-tracked) version, and a later change-tracked version.

The trick is to be able to assess whether or not the second version is indeed the derivative from the first that the change-tracking asserts.  This might not be the case, likely because of discrepancies in the different software used, but also as a result of mischief.

One way to accomplish this is to, internal to an assessment procedure, reverse all of the changes in the second document file that are subsequent to any changes that were in the first document file, and see if the document files are now equivalent.

In the DChanges discussion, this prospect was entertained, on the assumption that one can find some canonicalization that allows the equivalence to be assessed despite differences that are in fact irrelevant to the nature of the document but that have the document files not be lexically comparable.  The lack of identifier persistence is a contributor to this situation.  If different processors bind the same namespaces but to different prefixes, comparability is definitely defeated.  (I suspect that XML canonicalization as used for XML Digital Signatures may deal with the namespace binding case.)

The basic idea is to remove identifier dependencies.  This is possible because one is not interested in preserving the structure by which editing is possible, but simply assure that one has the essence of what is reflected in presentation (i.e., manifestation) of the document that the document file format instance is carrying.  One can also collapse many structural dependencies, such as on styles, by flattening all such dependencies down to exactly what applies in the rendering of layout and text strings in terms of the atomic format controls that apply at any point.

I don't know how far one can go with this.  It is highly format-dependent.  It might be too hard.

2. CONFIRMING DERIVATION OF ONE DOCUMENT FROM ANOTHER

Here is how I see a comparison function, working as sketched above, reporting its results.  The comparison process is clearly heuristic and based on schema awareness.  The comparison procedure should return one of three results:

 1. success - the document files are for manifestly-equivalent documents (a technical term, but basically their essential attributes in terms of what an users is presented are the same, to some degree of fidelity, are preserved).

 2. failure - the files are definitely not for manifestly-equivalent documents.

 3. not determined - the procedure was unable to determine one way or the other.

In practice, as cases of (3) arise, the heuristics are upgraded as necessary to reduce the (3) incidence, so long as the resulting determinations (1-2) remain sound.  As a practical matter, this might be a cursory procedure for performance reasons, and (3) triggers use of deeper forensic tools to go farther.  (I flunked a blood donation screening once and they had to send a sample out for a more-complicated testing to determine that my blood was actually OK.  I have continued to be a regular donor and the situation did not recur.  If it had recurred, I would have no longer been allowed to donate because they don't want to take the risk of having blood in state (3) being used by mistake and having unfortunate consequences.)

Notice that there is no suggestion that a DIFF would resolve this better.  We want to get to a place where a format-aware DIFF would come up with no differences to within manifest equivalence.

3. USE CASES AND REQUIREMENTS

 1. In change-tracking, even acceptance and rejections must be retained and be reversible for comparison purposes.   This is also a good requirement for provenance and the ability to extract a technical history of modifications.  (These are not likely to be at a sensible level for a non-technical level of changes at the user level of abstraction, but annotation of the provenance information could help in that respect, as well as linking changes for a common purpose together in some manner so they don't have to be reported at the individual technical-change level.)

 2. In change-tracking, it is not sufficient to mark the beginning and end of inserted material.  It is necessary to capture elsewhere exactly what was inserted, just as deleted material is captured.  (This could be by saving a digital hash of a canonical form though, apart from the complication that embedded identifiers -- labelings of sources and destinations -- represent.)  This is merely a safeguard on document integrity if this is all that is done.  If the changed documents are digitally signed, this becomes a check on the comparison canonicalization and the process of reversing changes/dispositions.  I can't satisfy myself that this step is necessary, nor am I certain it can be eliminated.  Help!

Received on Saturday, 20 September 2014 20:55:46 UTC