another possible use case for invisible XML

Another possible use case for invisible XML has presented itself to me.
I need to provide some background information, and then describe the use
case.  And then I have some questions for the the collective wisdom of
this discussion group.


BACKGROUND

This afternoon I learned that Dick Grune's web site has, in a collection
headed "ALGOL68 Legacy Data", an item described thus:

  * Revised Report on the Algorithmic Language ALGOL 68 (1975)  
    The text in its original, charming, totally unique and totally
    unexplained format.

A brief extract showing the beginning of the Introduction gives an idea
of the unexplained format:

    0   325.      *
    0   326.     |page||np| *
    0   327.     |h|0. |bs|Introduction|rs| *
    0   328.      *
    0   329.     |h|0.1. Aims and principles of design *
    0   330.      *
    0   331.     |pb1||j|a)|tp1|In designing the Algorithmic Language ALGOL 68, Working Group *
    0   332.     2.1 on ALGOL of the International Federation for Information Processing *
    0   333.     expresses its belief in the value of a common programming language *
    0   334.     serving many people in many countries. *
    0   335.     |pb1|b)|tp1|ALGOL 68 is designed to communicate algorithms, to execute *
    0   336.     them efficiently on a variety of different computers, and to aid in *
    0   337.     teaching them to students. *
    0   338.     |pb1|c)|tp1|This present Revision of the language is made in response to the *
    0   339.     directive of the parent committee, I.F.I.P.#TC#2, to the Working Group to *
    0   340.     "keep continually under review experience obtained as a consequence of *
    0   341.     this |co|original|cc| publication, so that it may institute such corrections *

[Digression: The Legacy Data web page says "All of [the files] are
punched card images, in differing formats."  What we see above looks to
me like the line printer output of a program for printing card decks (or
files) double spaced on a line printer, with line numbers.  The actual
card images appear to begin after the whitespace following the line
numbers.]

This looks a lot like input to to a batch formatter of some kind, and
the meaning of the control sequences enclosed in |...| can be
conjectured by comparison with a printed copy of the Algol 68 report.
The sequence |co| and the sequence |cc| appear to generate a left brace
and a right brace, respectively, in the printed output, and thus to mark
the material between them as a "'pragmatic' remark" not formally part of
the definition of the language.  (What some documents call a
non-normative note.)  Perhaps co and cc for 'comment open' and 'comment
close'?

It would be a lot easier to do interesting things with the text of the
Algol 68 report if it were available in a decent XML representation.
There is an HTML version prepared by one Marcel van der Meer, which
illustrates the kind of thing one could do.  It is accessible through
the Internet Archive's Wayback Machine, but the markup is ...
disappointing enough, both before and after running it through tidy,
that I think I would rather work from Grune's original than from that
HTML, if I ever wanted to do something with the text.


USE CASE

Can ixml be used to translate material in formats like this into XML
that would be easier to work with?

The ixml work flow will be familiar to anyone who has used ixml to work
with pre-existing data that doesn't have an explicit grammar: make a
grammar, look at the output, and revise the grammar until the output is
acceptable for further processing.


QUESTIONS

Experience teaches me that there is an alternative to the use of ixml
for a task like this.  I wonder whether there are principled ways to
choose which work flow to use, when.

A few years ago faced with a similar task (different batch formatter,
and one whose syntax and semantics I know well: Waterloo GML with a lot
of native Waterlook Script interspersed) I followed a work flow like
this:

  1 Translate the non-XML data into XML in the simplest and most direct
    way: make a document with the same structure and semantics, turning
    every control sequence in the input into an XML tag.  If there are
    known begin/end pairs, turn them into start- and end-tags for an
    element; turn other control sequences into empty elements.

    After this step, the beginning of the introduction might look like
    this:

      <page/><np/> 
      <h/>0. <bs/>Introduction<rs/> 
      
      <h/>0.1. Aims and principles of design 
      
      <pb1/><j/>a)<tp1/>In designing the Algorithmic Language ALGOL 68, Working Group 
      2.1 on ALGOL of the International Federation for Information Processing 
      ...

  2 Then write a series of transforms to move the markup closer to the
    desired structure and semantics.  Some of the work is likely to be
    uphill, and feel like markup enrichment / enhancement; some of the
    work is likely to feel like routine cleanup; some is likely to feel
    like routine transduction into a new format.  A pipeline of multiple
    smaller steps can be used.

For that task a few years ago, I had no ixml processor available, so I
did the first step in the simplest possible way -- I probably didn't
even write a stylesheet, just spent a few hours in emacs with the
document.

But in general, can we identify rules for dealing with this kind of
situation?  For people like me who program by preference in XSLT and
XQuery, and find ixml easy to use, I think the general principles might
be these:

  * In general, start with ixml.

    Even if the structure of control sequences is as simple as the regex
    '\|[^\|]+\|', use ixml, not an editor or a transform with a bunch of
    regex matches.  Ixml provides a record of the transformation (much
    better than "oh, it was a couple of hours' work in Emacs").  It
    makes it easy to handle different control sequences differently, for
    whatever reason.

    And crucially, it makes it a lot easier to capture larger
    structures.
    
  * Work to capture as much of the structure of the input in your ixml
    grammar as you can. A context-free grammar can capture structures
    whose beginnings are marked and whose ends are often not
    marked. It's possible to recognize implicit structure in XQuery and
    XSLT, too, and make it explicit, but in my experience (such as it
    is), doing it in a context-free grammar is simpler and feels
    lighter-weight.  It may be only my imagination that says it's easier
    to get it right in ixml.  But maybe my imagination is telling the
    truth.

  * There will come a time when further improvements to the XML produced
    by your ixml grammar will become difficult (whatever the word
    "improvements" ends up meaning in your context).  That's when you
    declare the ixml grammar done and write the rest of your processing
    pipeline using XSLT and XQuery.

If anyone reading this has further advice, or different advice, I would
be glad to hear it.

Michael



-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Monday, 17 April 2023 00:48:16 UTC