Re: Is abstraction to simple logical units a naive assumption? from Innovimax W3C on 2014-09-21 (public-change@w3.org from September 2014)

From: Innovimax W3C <innovimax+w3c@gmail.com>
Date: Sun, 21 Sep 2014 11:47:33 +0200
To: Dennis Hamilton <dennis.hamilton@acm.org>
Cc: "public-change@w3.org" <public-change@w3.org>
Message-ID: <CAAK2GfGt+cZQoG3j_bhkBmPjkbv5r6E2RdrtzRobhepq7YDpXg@mail.gmail.com>
Dennis,

I think that we cannot have access to the document you are refering to

My understanding is that ISO SC 34/ WG4 is the group working on a formal
representation of ODF <-> OOXML representation

If they already have published something usable, we should definitely use it

if not, I think we can stick to simple block-versus-inline

Regards,

Mohamed

On Sun, Sep 21, 2014 at 5:44 AM, Dennis E. Hamilton <dennis.hamilton@acm.org
> wrote:

> I noticed in DChanges discussion that there is recurring reference to
> abstraction to a few logical units with which a general change-tracking
> approach must deal. I recall seeing that elsewhere, and it occurred to me
> that there may be either an unfounded assumption or a misunderstanding
> about the material differences in all the ways one considers simple units
> such as paragraphs, tables, lists, formatted text-character sequences, etc.
>
> I want to look at this from two angles: first, the reality of ODF and its
> abstractions; secondly, what work is going on about reverse-abstracting
> from document-file formats for purposes of conversion, since that may
> inform change-tracking as well.
>
> THE GROUNDING OF ODF LOGICAL UNITS
>
> The Relax NG Schema for the structure and content of ODF documents
> consists of over 18,000 lines of XML.  You can see it at <
> http://nfoworks.org/notes/2014/05/n140504f.htm>.
>
> The ODF 1.2 specification is in three parts.  The first defines the
> document structure in ways that matter for change-tracking. The second part
> defines OpenFormula, new in ODF 1.2.  The third part defines the packaging
> technique, using a form of Zip, embedded XML documents, and other embedded
> content in the multi-part composition of single ODF 1.2 format document
> files.  The PDF for Part 1 has 846 pages and its table of contents is 73 of
> those pages.  I think that should signal concern that some of these matters
> are not so simple as we like to presume when starting out.
>
> Sticking with ODF Text documents, the content of the document is basically
> determined in a single <office:text> element.  <office:text> constituents
> are some prologue elements followed by a sequence of none or more elements
> of text-content type.  There is no character-string content directly in the
> flow of the text-content elements.  There are over 30 text-content elements
> including ones for tables, headings, lists, paragraphs, document sections,
> indexes that can be anywhere, including tables of content, and various
> shapes that are to be interspersed in the layout, including text frames
> with their own text-content flow, images, drawings and captions and labels
> on all of those.  In ODF, the three empty-element markers that reflect
> incidence of tracked changes are also text-content elements.
>
> Several text-content elements (beyond just the text-content paragraph
> element) contain, in addition to attributes specific to their function, a
> sequence of one or more paragraph-content elements.  Character-strings for
> text to be formatted and delivered in the presentation of the document are
> interspersed in the paragraph-content flow.  There are other places where
> formatted character-strings appear, but paragraph-content flow is the main
> one.  The three tracked changes markers are also of paragraph-content
> type.  There are over 100 paragraph-content elements and some of them can
> also nest paragraph-content flow and even text-content flow. Which is to
> say, there is almost nothing that cannot appear in a paragraph-content
> flow, according to the schema.
>
> This suggests that paragraph is not a particularly primitive thing as far
> as ODF is concerned.  That may be difficult to abstract to the notion of a
> simple abstract paragraph and other simple abstract entities.  Deletions
> can cut through and capture amazing aggregations of elements and text, and
> sunder existing elements in so doing.  Likewise, insertions can require
> repairs around the seams that occur where an insertion begins and where it
> ends, compared to how those two were joined (or a deletion occurred as part
> of an act of substitution).  Abstracting that will also be tricky.
>
> I'm not saying this is insurmountable.  I'm saying it is tricky for ODF
> and doubtless so for OOXML.  Getting to common abstractions suitable for
> working across document-file formats might not be as simple as using the
> same names would suggest.
>
> EFFORTS SO FAR
>
> There are three efforts I know of that have attempted to abstract ODF and
> OOXML sufficient to provide high-fidelity conversion between the two
> document-file formats.  While that is not exactly what CTMarkup is about, I
> think the challenge of common abstraction is relevant.
>
> One effort was to find a characterization that would enable conversion.
> This is necessarily a kind of reverse-engineering to an abstraction level
> where it is clear what is essential about one document file for correct
> reflection in a conversion to the other format.  The effort was carried out
> by Fraunhofer Institute on behalf of an ISO/IEC working group created
> specifically to consider interoperability between ODF and OOXML at the very
> least.  Here's what is currently available:
> <
> http://www.fokus.fraunhofer.de/en/elan/projekte/international/dokumenteninteroperabilitaet/index.html>.
> There is a significant pay-wall on one technical report, <
> http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45245>.
> However, it is also available as a free publicly available specification at
> <
> http://standards.iso.org/ittf/PubliclyAvailableStandards/c045245_ISO_IEC_TR_29166_2011.zip
> >.
>
> Another effort comes from Beijing.  That investigation is attempting to
> devise a model in which mapping up into abstractions and then down again is
> the approach.  I have seen only intermediate work and I can't tell what is
> available to the public.  I trust that this work will lead to another
> ISO/IEC Technical Report.
>
> The third effort involved an attempt to identify safe templates for
> classes of documents where the version of the template for Microsoft Office
> and the version of the template for OpenOffice.org would effectively
> profile documents for which inter-conversion (or interchange in the same
> format, whether OOXML or ODF) would work cleanly across the major
> implementations.  I know that there was participation from Microsoft
> Research Cambridge (UK), but I don't know what the results were and how
> useful they turned out to be.  (I think of this as a very practical case of
> profiling.)  This *might* be tied to the Fraunhofer FOKUS white paper on
> Document Interoperability: ODF-OOXML linked on the Frauhofer institute
> page.  The download request form does not work for me.
>
> My only point is that this is seriously non-trivial and it is important to
> avoid diving into a solution that works for simple cases and cannot be
> scaled out to handle all the details.  Finding the simplest that can
> possibly work is important, but as Einstein did not recall saying, "Things
> should be kept as simple as possible, but no simpler."
>
>
>  -- Dennis E. Hamilton
>     dennis.hamilton@acm.org    +1-206-779-9430
>     https://keybase.io/orcmid  PGP F96E 89FF D456 628A
>     X.509 certs used and requested for signed e-mail
>
>
>
>
>
>
>
>


-- 
Innovimax SARL
Consulting, Training & XML Development
9, impasse des Orteaux
75020 Paris
Tel : +33 9 52 475787
Fax : +33 1 4356 1746
http://www.innovimax.fr
RCS Paris 488.018.631
SARL au capital de 10.000 €
Received on Sunday, 21 September 2014 09:48:03 UTC