Is abstraction to simple logical units a naive assumption? from Dennis E. Hamilton on 2014-09-21 (public-change@w3.org from September 2014)

From: Dennis E. Hamilton <dennis.hamilton@acm.org>
Date: Sat, 20 Sep 2014 20:44:58 -0700
To: <public-change@w3.org>
Message-ID: <009d01cfd54e$6b141f80$413c5e80$@acm.org>

I noticed in DChanges discussion that there is recurring reference to abstraction to a few logical units with which a general change-tracking approach must deal. I recall seeing that elsewhere, and it occurred to me that there may be either an unfounded assumption or a misunderstanding about the material differences in all the ways one considers simple units such as paragraphs, tables, lists, formatted text-character sequences, etc.

I want to look at this from two angles: first, the reality of ODF and its abstractions; secondly, what work is going on about reverse-abstracting from document-file formats for purposes of conversion, since that may inform change-tracking as well.

THE GROUNDING OF ODF LOGICAL UNITS

The Relax NG Schema for the structure and content of ODF documents consists of over 18,000 lines of XML. You can see it at <http://nfoworks.org/notes/2014/05/n140504f.htm>.

The ODF 1.2 specification is in three parts. The first defines the document structure in ways that matter for change-tracking. The second part defines OpenFormula, new in ODF 1.2. The third part defines the packaging technique, using a form of Zip, embedded XML documents, and other embedded content in the multi-part composition of single ODF 1.2 format document files. The PDF for Part 1 has 846 pages and its table of contents is 73 of those pages. I think that should signal concern that some of these matters are not so simple as we like to presume when starting out.

Sticking with ODF Text documents, the content of the document is basically determined in a single <office:text> element. <office:text> constituents are some prologue elements followed by a sequence of none or more elements of text-content type. There is no character-string content directly in the flow of the text-content elements. There are over 30 text-content elements including ones for tables, headings, lists, paragraphs, document sections, indexes that can be anywhere, including tables of content, and various shapes that are to be interspersed in the layout, including text frames with their own text-content flow, images, drawings and captions and labels on all of those. In ODF, the three empty-element markers that reflect incidence of tracked changes are also text-content elements.

Several text-content elements (beyond just the text-content paragraph element) contain, in addition to attributes specific to their function, a sequence of one or more paragraph-content elements. Character-strings for text to be formatted and delivered in the presentation of the document are interspersed in the paragraph-content flow. There are other places where formatted character-strings appear, but paragraph-content flow is the main one. The three tracked changes markers are also of paragraph-content type. There are over 100 paragraph-content elements and some of them can also nest paragraph-content flow and even text-content flow. Which is to say, there is almost nothing that cannot appear in a paragraph-content flow, according to the schema.

This suggests that paragraph is not a particularly primitive thing as far as ODF is concerned. That may be difficult to abstract to the notion of a simple abstract paragraph and other simple abstract entities. Deletions can cut through and capture amazing aggregations of elements and text, and sunder existing elements in so doing. Likewise, insertions can require repairs around the seams that occur where an insertion begins and where it ends, compared to how those two were joined (or a deletion occurred as part of an act of substitution). Abstracting that will also be tricky.

I'm not saying this is insurmountable. I'm saying it is tricky for ODF and doubtless so for OOXML. Getting to common abstractions suitable for working across document-file formats might not be as simple as using the same names would suggest.

EFFORTS SO FAR

There are three efforts I know of that have attempted to abstract ODF and OOXML sufficient to provide high-fidelity conversion between the two document-file formats. While that is not exactly what CTMarkup is about, I think the challenge of common abstraction is relevant.

One effort was to find a characterization that would enable conversion. This is necessarily a kind of reverse-engineering to an abstraction level where it is clear what is essential about one document file for correct reflection in a conversion to the other format. The effort was carried out by Fraunhofer Institute on behalf of an ISO/IEC working group created specifically to consider interoperability between ODF and OOXML at the very least. Here's what is currently available:
<http://www.fokus.fraunhofer.de/en/elan/projekte/international/dokumenteninteroperabilitaet/index.html>. There is a significant pay-wall on one technical report, <http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45245>. However, it is also available as a free publicly available specification at <http://standards.iso.org/ittf/PubliclyAvailableStandards/c045245_ISO_IEC_TR_29166_2011.zip>.

Another effort comes from Beijing. That investigation is attempting to devise a model in which mapping up into abstractions and then down again is the approach. I have seen only intermediate work and I can't tell what is available to the public. I trust that this work will lead to another ISO/IEC Technical Report.

The third effort involved an attempt to identify safe templates for classes of documents where the version of the template for Microsoft Office and the version of the template for OpenOffice.org would effectively profile documents for which inter-conversion (or interchange in the same format, whether OOXML or ODF) would work cleanly across the major implementations. I know that there was participation from Microsoft Research Cambridge (UK), but I don't know what the results were and how useful they turned out to be. (I think of this as a very practical case of profiling.) This *might* be tied to the Fraunhofer FOKUS white paper on Document Interoperability: ODF-OOXML linked on the Frauhofer institute page. The download request form does not work for me.

My only point is that this is seriously non-trivial and it is important to avoid diving into a solution that works for simple cases and cannot be scaled out to handle all the details. Finding the simplest that can possibly work is important, but as Einstein did not recall saying, "Things should be kept as simple as possible, but no simpler."

-- Dennis E. Hamilton
dennis.hamilton@acm.org +1-206-779-9430
https://keybase.io/orcmid PGP F96E 89FF D456 628A
X.509 certs used and requested for signed e-mail

Received on Sunday, 21 September 2014 03:45:27 UTC