- From: Innovimax W3C <innovimax+w3c@gmail.com>
- Date: Sun, 21 Sep 2014 11:47:33 +0200
- To: Dennis Hamilton <dennis.hamilton@acm.org>
- Cc: "public-change@w3.org" <public-change@w3.org>
- Message-ID: <CAAK2GfGt+cZQoG3j_bhkBmPjkbv5r6E2RdrtzRobhepq7YDpXg@mail.gmail.com>
Dennis, I think that we cannot have access to the document you are refering to My understanding is that ISO SC 34/ WG4 is the group working on a formal representation of ODF <-> OOXML representation If they already have published something usable, we should definitely use it if not, I think we can stick to simple block-versus-inline Regards, Mohamed On Sun, Sep 21, 2014 at 5:44 AM, Dennis E. Hamilton <dennis.hamilton@acm.org > wrote: > I noticed in DChanges discussion that there is recurring reference to > abstraction to a few logical units with which a general change-tracking > approach must deal. I recall seeing that elsewhere, and it occurred to me > that there may be either an unfounded assumption or a misunderstanding > about the material differences in all the ways one considers simple units > such as paragraphs, tables, lists, formatted text-character sequences, etc. > > I want to look at this from two angles: first, the reality of ODF and its > abstractions; secondly, what work is going on about reverse-abstracting > from document-file formats for purposes of conversion, since that may > inform change-tracking as well. > > THE GROUNDING OF ODF LOGICAL UNITS > > The Relax NG Schema for the structure and content of ODF documents > consists of over 18,000 lines of XML. You can see it at < > http://nfoworks.org/notes/2014/05/n140504f.htm>. > > The ODF 1.2 specification is in three parts. The first defines the > document structure in ways that matter for change-tracking. The second part > defines OpenFormula, new in ODF 1.2. The third part defines the packaging > technique, using a form of Zip, embedded XML documents, and other embedded > content in the multi-part composition of single ODF 1.2 format document > files. The PDF for Part 1 has 846 pages and its table of contents is 73 of > those pages. I think that should signal concern that some of these matters > are not so simple as we like to presume when starting out. > > Sticking with ODF Text documents, the content of the document is basically > determined in a single <office:text> element. <office:text> constituents > are some prologue elements followed by a sequence of none or more elements > of text-content type. There is no character-string content directly in the > flow of the text-content elements. There are over 30 text-content elements > including ones for tables, headings, lists, paragraphs, document sections, > indexes that can be anywhere, including tables of content, and various > shapes that are to be interspersed in the layout, including text frames > with their own text-content flow, images, drawings and captions and labels > on all of those. In ODF, the three empty-element markers that reflect > incidence of tracked changes are also text-content elements. > > Several text-content elements (beyond just the text-content paragraph > element) contain, in addition to attributes specific to their function, a > sequence of one or more paragraph-content elements. Character-strings for > text to be formatted and delivered in the presentation of the document are > interspersed in the paragraph-content flow. There are other places where > formatted character-strings appear, but paragraph-content flow is the main > one. The three tracked changes markers are also of paragraph-content > type. There are over 100 paragraph-content elements and some of them can > also nest paragraph-content flow and even text-content flow. Which is to > say, there is almost nothing that cannot appear in a paragraph-content > flow, according to the schema. > > This suggests that paragraph is not a particularly primitive thing as far > as ODF is concerned. That may be difficult to abstract to the notion of a > simple abstract paragraph and other simple abstract entities. Deletions > can cut through and capture amazing aggregations of elements and text, and > sunder existing elements in so doing. Likewise, insertions can require > repairs around the seams that occur where an insertion begins and where it > ends, compared to how those two were joined (or a deletion occurred as part > of an act of substitution). Abstracting that will also be tricky. > > I'm not saying this is insurmountable. I'm saying it is tricky for ODF > and doubtless so for OOXML. Getting to common abstractions suitable for > working across document-file formats might not be as simple as using the > same names would suggest. > > EFFORTS SO FAR > > There are three efforts I know of that have attempted to abstract ODF and > OOXML sufficient to provide high-fidelity conversion between the two > document-file formats. While that is not exactly what CTMarkup is about, I > think the challenge of common abstraction is relevant. > > One effort was to find a characterization that would enable conversion. > This is necessarily a kind of reverse-engineering to an abstraction level > where it is clear what is essential about one document file for correct > reflection in a conversion to the other format. The effort was carried out > by Fraunhofer Institute on behalf of an ISO/IEC working group created > specifically to consider interoperability between ODF and OOXML at the very > least. Here's what is currently available: > < > http://www.fokus.fraunhofer.de/en/elan/projekte/international/dokumenteninteroperabilitaet/index.html>. > There is a significant pay-wall on one technical report, < > http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45245>. > However, it is also available as a free publicly available specification at > < > http://standards.iso.org/ittf/PubliclyAvailableStandards/c045245_ISO_IEC_TR_29166_2011.zip > >. > > Another effort comes from Beijing. That investigation is attempting to > devise a model in which mapping up into abstractions and then down again is > the approach. I have seen only intermediate work and I can't tell what is > available to the public. I trust that this work will lead to another > ISO/IEC Technical Report. > > The third effort involved an attempt to identify safe templates for > classes of documents where the version of the template for Microsoft Office > and the version of the template for OpenOffice.org would effectively > profile documents for which inter-conversion (or interchange in the same > format, whether OOXML or ODF) would work cleanly across the major > implementations. I know that there was participation from Microsoft > Research Cambridge (UK), but I don't know what the results were and how > useful they turned out to be. (I think of this as a very practical case of > profiling.) This *might* be tied to the Fraunhofer FOKUS white paper on > Document Interoperability: ODF-OOXML linked on the Frauhofer institute > page. The download request form does not work for me. > > My only point is that this is seriously non-trivial and it is important to > avoid diving into a solution that works for simple cases and cannot be > scaled out to handle all the details. Finding the simplest that can > possibly work is important, but as Einstein did not recall saying, "Things > should be kept as simple as possible, but no simpler." > > > -- Dennis E. Hamilton > dennis.hamilton@acm.org +1-206-779-9430 > https://keybase.io/orcmid PGP F96E 89FF D456 628A > X.509 certs used and requested for signed e-mail > > > > > > > > -- Innovimax SARL Consulting, Training & XML Development 9, impasse des Orteaux 75020 Paris Tel : +33 9 52 475787 Fax : +33 1 4356 1746 http://www.innovimax.fr RCS Paris 488.018.631 SARL au capital de 10.000 €
Received on Sunday, 21 September 2014 09:48:03 UTC