Re: another possible use case for invisible XML

I have had a similar task in the past, converting a large (400kb) troff 
file to xhtml, which I did at the time with a sed script.
I can concur with a couple of points you made:
* Doing it with a script, which records the transformations, is much better 
than doing it in emacs
* ixml would have made it easier because it would have allowed me to more 
easily transform larger structures than the line-based ones possible with 
sed, and would have helped getting the bracketing structure right (and I 
wouldn't have had to fight with getting the right number of backslashes:


 s/\\\\o'a\\\\(ga'/\à/
 s/\\\\\\\\ /\\/g
)

Anyway, it sounds like an interesting task, and I'm looking forward to 
hearing about the progress (I have written to Meertens to ask if he knows 
details about the typesetting language; while searching for other possible 
authors/editors, I discovered that Sintzoff has died -- Lindsey also died 2 
months ago -- and so I added a photo of him to wikipedia).

Steven

On Sunday 16 April 2023 23:59:27 (+02:00), C. M. Sperberg-McQueen wrote:

 > Another possible use case for invisible XML has presented itself to me.
 > I need to provide some background information, and then describe the use
 > case.  And then I have some questions for the the collective wisdom of
 > this discussion group.
 > 
 > 
 > BACKGROUND
 > 
 > This afternoon I learned that Dick Grune's web site has, in a collection
 > headed "ALGOL68 Legacy Data", an item described thus:
 > 
 >   * Revised Report on the Algorithmic Language ALGOL 68 (1975)  
 >     The text in its original, charming, totally unique and totally
 >     unexplained format.
 > 
 > A brief extract showing the beginning of the Introduction gives an idea
 > of the unexplained format:
 > 
 >     0   325.      *
 >     0   326.     |page||np| *
 >     0   327.     |h|0. |bs|Introduction|rs| *
 >     0   328.      *
 >     0   329.     |h|0.1. Aims and principles of design *
 >     0   330.      *
 >     0   331.     |pb1||j|a)|tp1|In designing the Algorithmic Language 
ALGOL 68, Working Group *
 >     0   332.     2.1 on ALGOL of the International Federation for 
Information Processing *
 >     0   333.     expresses its belief in the value of a common 
programming language *
 >     0   334.     serving many people in many countries. *
 >     0   335.     |pb1|b)|tp1|ALGOL 68 is designed to communicate 
algorithms, to execute *
 >     0   336.     them efficiently on a variety of different computers, 
and to aid in *
 >     0   337.     teaching them to students. *
 >     0   338.     |pb1|c)|tp1|This present Revision of the language is 
made in response to the *
 >     0   339.     directive of the parent committee, I.F.I.P.#TC#2, to 
the Working Group to *
 >     0   340.     "keep continually under review experience obtained as a 
consequence of *
 >     0   341.     this |co|original|cc| publication, so that it may 
institute such corrections *
 > 
 > [Digression: The Legacy Data web page says "All of [the files] are
 > punched card images, in differing formats."  What we see above looks to
 > me like the line printer output of a program for printing card decks (or
 > files) double spaced on a line printer, with line numbers.  The actual
 > card images appear to begin after the whitespace following the line
 > numbers.]
 > 
 > This looks a lot like input to to a batch formatter of some kind, and
 > the meaning of the control sequences enclosed in |...| can be
 > conjectured by comparison with a printed copy of the Algol 68 report.
 > The sequence |co| and the sequence |cc| appear to generate a left brace
 > and a right brace, respectively, in the printed output, and thus to mark
 > the material between them as a "'pragmatic' remark" not formally part of
 > the definition of the language.  (What some documents call a
 > non-normative note.)  Perhaps co and cc for 'comment open' and 'comment
 > close'?
 > 
 > It would be a lot easier to do interesting things with the text of the
 > Algol 68 report if it were available in a decent XML representation.
 > There is an HTML version prepared by one Marcel van der Meer, which
 > illustrates the kind of thing one could do.  It is accessible through
 > the Internet Archive's Wayback Machine, but the markup is ...
 > disappointing enough, both before and after running it through tidy,
 > that I think I would rather work from Grune's original than from that
 > HTML, if I ever wanted to do something with the text.
 > 
 > 
 > USE CASE
 > 
 > Can ixml be used to translate material in formats like this into XML
 > that would be easier to work with?
 > 
 > The ixml work flow will be familiar to anyone who has used ixml to work
 > with pre-existing data that doesn't have an explicit grammar: make a
 > grammar, look at the output, and revise the grammar until the output is
 > acceptable for further processing.
 > 
 > 
 > QUESTIONS
 > 
 > Experience teaches me that there is an alternative to the use of ixml
 > for a task like this.  I wonder whether there are principled ways to
 > choose which work flow to use, when.
 > 
 > A few years ago faced with a similar task (different batch formatter,
 > and one whose syntax and semantics I know well: Waterloo GML with a lot
 > of native Waterlook Script interspersed) I followed a work flow like
 > this:
 > 
 >   1 Translate the non-XML data into XML in the simplest and most direct
 >     way: make a document with the same structure and semantics, turning
 >     every control sequence in the input into an XML tag.  If there are
 >     known begin/end pairs, turn them into start- and end-tags for an
 >     element; turn other control sequences into empty elements.
 > 
 >     After this step, the beginning of the introduction might look like
 >     this:
 > 
 >        
 >       0. Introduction 
 >       
 >       0.1. Aims and principles of design 
 >       
 >       a)In designing the Algorithmic Language ALGOL 68, Working Group 
 >       2.1 on ALGOL of the International Federation for Information 
Processing 
 >       ...
 > 
 >   2 Then write a series of transforms to move the markup closer to the
 >     desired structure and semantics.  Some of the work is likely to be
 >     uphill, and feel like markup enrichment / enhancement; some of the
 >     work is likely to feel like routine cleanup; some is likely to feel
 >     like routine transduction into a new format.  A pipeline of multiple
 >     smaller steps can be used.
 > 
 > For that task a few years ago, I had no ixml processor available, so I
 > did the first step in the simplest possible way -- I probably didn't
 > even write a stylesheet, just spent a few hours in emacs with the
 > document.
 > 
 > But in general, can we identify rules for dealing with this kind of
 > situation?  For people like me who program by preference in XSLT and
 > XQuery, and find ixml easy to use, I think the general principles might
 > be these:
 > 
 >   * In general, start with ixml.
 > 
 >     Even if the structure of control sequences is as simple as the regex
 >     '\|[^\|]+\|', use ixml, not an editor or a transform with a bunch of
 >     regex matches.  Ixml provides a record of the transformation (much
 >     better than "oh, it was a couple of hours' work in Emacs").  It
 >     makes it easy to handle different control sequences differently, for
 >     whatever reason.
 > 
 >     And crucially, it makes it a lot easier to capture larger
 >     structures.
 >     
 >   * Work to capture as much of the structure of the input in your ixml
 >     grammar as you can. A context-free grammar can capture structures
 >     whose beginnings are marked and whose ends are often not
 >     marked. It's possible to recognize implicit structure in XQuery and
 >     XSLT, too, and make it explicit, but in my experience (such as it
 >     is), doing it in a context-free grammar is simpler and feels
 >     lighter-weight.  It may be only my imagination that says it's easier
 >     to get it right in ixml.  But maybe my imagination is telling the
 >     truth.
 > 
 >   * There will come a time when further improvements to the XML produced
 >     by your ixml grammar will become difficult (whatever the word
 >     "improvements" ends up meaning in your context).  That's when you
 >     declare the ixml grammar done and write the rest of your processing
 >     pipeline using XSLT and XQuery.
 > 
 > If anyone reading this has further advice, or different advice, I would
 > be glad to hear it.
 > 
 > Michael
 > 
 > 
 > 

Received on Tuesday, 18 April 2023 01:02:01 UTC