DChanges: The Challenge of Identifier Immutability

A particular source of frustration at DChanges 2014 was over the fact that identifiers are not preserved in document files between consumption and production of an edited or changed version.

It is the case that, for OOXML and ODF, there are numerous uses of identifiers (not merely xml:id and the ID and IDREF types, which are actually in the minority).  It is also the case that there is no requirement for the preservation of identifiers from input to output.  It is the referential integrity that must be preserved (if even that), and not the actual identifiers used as part of the structural connections among and across the multi-part objects and the XML document forms used for some of the parts.

In order to provide for external (to the existing structures) versioning information, some of the work has been frustrated because it is difficult to assure identifier persistence across processing.  And while there are solutions that work with a given implementation, there is no assurance that other implementations, some not having any code base commonality, will preserve the necessary identifiers.

In order for the work by some investigators to proceed, it was necessary to modify software enough to have some modicum of invariance from input to output.  This is complicated by the fact that there is no requirement that implementations preserve *injected* identifiers for which there is no provision in the format definition or the implementation.  It is also difficult to convince some implementations to preserve injected content that cohabits in the Zip packagings that the multi-part document-file formats OOXML and ODF use.

There have been workarounds. 

Now, it happens that there is need for comparability of document files that represent changes to other document files.  This rewriting of identifiers and other variations among format usage that are actually equivalent but textually different confounds naïve assumptions about using DIFF at the XML level, even canonical XML (which would be an improvement anywhere that differences include prefixes that bind to namespaces).

I want to discuss the document integrity, provenance, and verifiability issue separately (for whatever use cases arise there).

However, I suggest that dependence on identifier immutability is problematic and not readily achievable for *interoperable* use of standard formats in interchange.  This is easy to demonstrate.

There are three major software implementations of ODF.  The obvious two are LibreOffice and Apache OpenOffice, both having common ancestry in OpenOffice.org (OO.o to its friends).  These projects are thriving and I know from the logs that Apache OpenOffice has experienced in excess of 10,000 downloads per day since their first release in 2o12.  AOO and LibO are not maintained in synchronization and, until something is done about it, interoperability is likely to decline.  More compelling is the situation with Microsoft Office, which has supported ODF to increasingly-improved levels since introduction of a service pack for Office 2007.  This now makes Microsoft Office the most widely-spread implementation of ODF on the planet, although I suspect its usage might be almost-negligible in comparison with the multi-platform OO.o descendants.  The point is that there is no commonality of code base to appeal to in this mix.  In addition, Microsoft Office does not support change-tracking in ODF and might never unless the specification for it is corrected and it is something that can be mapped into and out of Microsoft Office.  This particular situation is acute enough that there are organizations willing to pay for a remedy.

There are other implementations of ODF (and specialized applications built on OOXML) that might or might not benefit from support for a common change-tracking solution, or deal with not supporting change-tracking. WordPerfect and Google Docs have some modicum of ODF support.  Even Microsoft WordPad (formerly Microsoft Write) delivered with every instance of Microsoft Windows for the desktop, supports ODF and OOXML with subset functionality.  I know there are others out there.  It is tough to assume that identifier immutability is going to be workable across this space in any rapid time frame, and there is always the legacy situation.  (Apache OpenOffice continues to receive reports from users working with versions of OO.o back as far as 3.4 and earlier.  There are some prepared to take their version of StarOffice to the grave, often because their particular computer is no longer supported by any OO.o descendants.)

Life is messy.

 - Dennis

Received on Saturday, 20 September 2014 20:06:49 UTC