RE: Is abstraction to simple logical units a naive assumption? from Dennis E. Hamilton on 2014-09-23 (public-change@w3.org from September 2014)

From: Dennis E. Hamilton <dennis.hamilton@acm.org>
Date: Tue, 23 Sep 2014 10:16:42 -0700
To: <public-change@w3.org>
Message-ID: <008701cfd752$27c425d0$774c7170$@acm.org>
Robin,
 
Thanks. I agree.  I am pre-occupied with the problem of documents at a different level of abstraction with XML as a medium for the document file format.  I don’t doubt the appeal of a general XML change-tracking at levels where the document file is XML in a pure way.
 
I mean to dissuade folks from assuming that what works at the more-or-less pure XML document level will work without regard for the structural conditions that must be honored in how XML is used as a carrier for rich, complex document file formats.  
 
I was delighted to see a list of “naïve assumptions” that some investigators at DChanges had faced in working with ODF.  The first one on their list is that “LibreOffice is an XML application” and that is about the use of ODF, of course.
 
-   Dennis
 
From: Robin LaFontaine [mailto:robin.lafontaine@deltaxml.com] 
Sent: Tuesday, September 23, 2014 09:12
To: public-change@w3.org
Subject: Re: Is abstraction to simple logical units a naive assumption?
 
You have described very well, Dennis, the complexities of trying to do this. I would just add the comment that your focus is 'document-centric' in that it discusses the problem for (XML) formats that represent 'documents'. That is not to say that your observations are incorrect in any way (I agree with them!), but we must not forget the plethora of XML formats such as StratML that are not documents per se but still need and want change-tracking.

The simple logical unit that covers all of these is the XML syntax itself, and going further up the semantic tree (if I can use that term to describe tables, paragraphs, lists etc) seems, as you correctly observe, very difficult.

-- 
Robin La Fontaine
Director
DeltaXML Ltd "Experts in information change"
 
T: +44 1684 592 144 
E: robin.lafontaine@deltaxml.com <mailto:robin.lafontaine@deltaxml.com>  
W: http://www.deltaxml.com
Malvern Hills Science Park, Malvern, Worcs, WR14 3SZ, UK
Registered in England 02528681 Reg. Office: Monsell House, WR8 0QN, UK
 
On 21/09/2014 16:43, Dennis E. Hamilton wrote:
The ISO/IEC Technical Report from Fraunhofer FOKUS is available for free download at
<http://standards.iso.org/ittf/PubliclyAvailableStandards/c045245_ISO_IEC_TR_29166_2011.zip>. 
 
I just downloaded it without any logon or fee.  (I do not have an SC34 account.)  The title of the 168-page PDF is “Guidelines for translation between ISO/IEC 26300 and ISO/IEC 29500 document formats.”  This is useful for understanding the complexities of those formats for change-tracking purposes.  The extensive use cases may be useful to CTMarkup even though they were being considered with respect to translation between formats.
 
The distinct ODF-OOXML Interoperability paper is not reachable on the FOKUS web site – the request form is not working.  I believe this is because the site is being reconfigured.  This may also be redundant considering that TR 29166 is available.  I can’t be certain.
 
The work-in-progress by the Beijing participant on SC34 WG5 is circulated privately and is not available to the public.  I believe the intention is to produce a TR of it as experimental work.  I must check with the researcher.  The proposal I am aware of is to go beyond the current TR 29166 and arrive at technical measures of interoperability using a topic map hierarchy that deals with over 300 features (e.g., paragraph explodes into considerable richness along with consideration of styling such as for drop caps and other details).
 
I should add that, when styling is considered, the variations applicable to text-content elements and paragraph-content elements in ODF and OOXML counterparts are daunting.  These may collapse to small generic cases, but that remains to be determined.
 
-   Dennis
 
 
 
From: innovimax@gmail.com <mailto:innovimax@gmail.com>  [mailto:innovimax@gmail.com] On Behalf Of Innovimax W3C
Sent: Sunday, September 21, 2014 02:48
To: Dennis Hamilton
Cc: public-change@w3.org <mailto:public-change@w3.org> 
Subject: Re: Is abstraction to simple logical units a naive assumption?
 
Dennis,
I think that we cannot have access to the document you are refering to

My understanding is that ISO SC 34/ WG4 is the group working on a formal representation of ODF <-> OOXML representation
If they already have published something usable, we should definitely use it
if not, I think we can stick to simple block-versus-inline
Regards,
Mohamed
 
On Sun, Sep 21, 2014 at 5:44 AM, Dennis E. Hamilton <dennis.hamilton@acm.org <mailto:dennis.hamilton@acm.org> > wrote:
I noticed in DChanges discussion that there is recurring reference to abstraction to a few logical units with which a general change-tracking approach must deal. I recall seeing that elsewhere, and it occurred to me that there may be either an unfounded assumption or a misunderstanding about the material differences in all the ways one considers simple units such as paragraphs, tables, lists, formatted text-character sequences, etc.

I want to look at this from two angles: first, the reality of ODF and its abstractions; secondly, what work is going on about reverse-abstracting from document-file formats for purposes of conversion, since that may inform change-tracking as well.

THE GROUNDING OF ODF LOGICAL UNITS

The Relax NG Schema for the structure and content of ODF documents consists of over 18,000 lines of XML.  You can see it at <http://nfoworks.org/notes/2014/05/n140504f.htm>.

The ODF 1.2 specification is in three parts.  The first defines the document structure in ways that matter for change-tracking. The second part defines OpenFormula, new in ODF 1.2.  The third part defines the packaging technique, using a form of Zip, embedded XML documents, and other embedded content in the multi-part composition of single ODF 1.2 format document files.  The PDF for Part 1 has 846 pages and its table of contents is 73 of those pages.  I think that should signal concern that some of these matters are not so simple as we like to presume when starting out.

Sticking with ODF Text documents, the content of the document is basically determined in a single <office:text> element.  <office:text> constituents are some prologue elements followed by a sequence of none or more elements of text-content type.  There is no character-string content directly in the flow of the text-content elements.  There are over 30 text-content elements including ones for tables, headings, lists, paragraphs, document sections, indexes that can be anywhere, including tables of content, and various shapes that are to be interspersed in the layout, including text frames with their own text-content flow, images, drawings and captions and labels on all of those.  In ODF, the three empty-element markers that reflect incidence of tracked changes are also text-content elements.

Several text-content elements (beyond just the text-content paragraph element) contain, in addition to attributes specific to their function, a sequence of one or more paragraph-content elements.  Character-strings for text to be formatted and delivered in the presentation of the document are interspersed in the paragraph-content flow.  There are other places where formatted character-strings appear, but paragraph-content flow is the main one.  The three tracked changes markers are also of paragraph-content type.  There are over 100 paragraph-content elements and some of them can also nest paragraph-content flow and even text-content flow. Which is to say, there is almost nothing that cannot appear in a paragraph-content flow, according to the schema.

This suggests that paragraph is not a particularly primitive thing as far as ODF is concerned.  That may be difficult to abstract to the notion of a simple abstract paragraph and other simple abstract entities.  Deletions can cut through and capture amazing aggregations of elements and text, and sunder existing elements in so doing.  Likewise, insertions can require repairs around the seams that occur where an insertion begins and where it ends, compared to how those two were joined (or a deletion occurred as part of an act of substitution).  Abstracting that will also be tricky.

I'm not saying this is insurmountable.  I'm saying it is tricky for ODF and doubtless so for OOXML.  Getting to common abstractions suitable for working across document-file formats might not be as simple as using the same names would suggest.

EFFORTS SO FAR

There are three efforts I know of that have attempted to abstract ODF and OOXML sufficient to provide high-fidelity conversion between the two document-file formats.  While that is not exactly what CTMarkup is about, I think the challenge of common abstraction is relevant.

One effort was to find a characterization that would enable conversion.  This is necessarily a kind of reverse-engineering to an abstraction level where it is clear what is essential about one document file for correct reflection in a conversion to the other format.  The effort was carried out by Fraunhofer Institute on behalf of an ISO/IEC working group created specifically to consider interoperability between ODF and OOXML at the very least.  Here's what is currently available:
<http://www.fokus.fraunhofer.de/en/elan/projekte/international/dokumenteninteroperabilitaet/index.html>.  There is a significant pay-wall on one technical report, <http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45245>. However, it is also available as a free publicly available specification at <http://standards.iso.org/ittf/PubliclyAvailableStandards/c045245_ISO_IEC_TR_29166_2011.zip>.

Another effort comes from Beijing.  That investigation is attempting to devise a model in which mapping up into abstractions and then down again is the approach.  I have seen only intermediate work and I can't tell what is available to the public.  I trust that this work will lead to another ISO/IEC Technical Report.

The third effort involved an attempt to identify safe templates for classes of documents where the version of the template for Microsoft Office and the version of the template for OpenOffice.org would effectively profile documents for which inter-conversion (or interchange in the same format, whether OOXML or ODF) would work cleanly across the major implementations.  I know that there was participation from Microsoft Research Cambridge (UK), but I don't know what the results were and how useful they turned out to be.  (I think of this as a very practical case of profiling.)  This *might* be tied to the Fraunhofer FOKUS white paper on Document Interoperability: ODF-OOXML linked on the Frauhofer institute page.  The download request form does not work for me.

My only point is that this is seriously non-trivial and it is important to avoid diving into a solution that works for simple cases and cannot be scaled out to handle all the details.  Finding the simplest that can possibly work is important, but as Einstein did not recall saying, "Things should be kept as simple as possible, but no simpler."


 -- Dennis E. Hamilton
    dennis.hamilton@acm.org <mailto:dennis.hamilton@acm.org>     +1-206-779-9430 <tel:%2B1-206-779-9430> 
    https://keybase.io/orcmid  PGP F96E 89FF D456 628A
    X.509 certs used and requested for signed e-mail










-- 
Innovimax SARL
Consulting, Training & XML Development
9, impasse des Orteaux
75020 Paris
Tel : +33 9 52 475787 <tel:%2B33%209%2052%20475787> 
Fax : +33 1 4356 1746 <tel:%2B33%201%204356%201746> 
http://www.innovimax.fr
RCS Paris 488.018.631
SARL au capital de 10.000 €
Received on Tuesday, 23 September 2014 17:17:25 UTC