- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Tue, 18 Apr 2023 01:01:34 +0000
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
I have had a similar task in the past, converting a large (400kb) troff file to xhtml, which I did at the time with a sed script. I can concur with a couple of points you made: * Doing it with a script, which records the transformations, is much better than doing it in emacs * ixml would have made it easier because it would have allowed me to more easily transform larger structures than the line-based ones possible with sed, and would have helped getting the bracketing structure right (and I wouldn't have had to fight with getting the right number of backslashes: s/\\\\o'a\\\\(ga'/\à/ s/\\\\\\\\ /\\/g ) Anyway, it sounds like an interesting task, and I'm looking forward to hearing about the progress (I have written to Meertens to ask if he knows details about the typesetting language; while searching for other possible authors/editors, I discovered that Sintzoff has died -- Lindsey also died 2 months ago -- and so I added a photo of him to wikipedia). Steven On Sunday 16 April 2023 23:59:27 (+02:00), C. M. Sperberg-McQueen wrote: > Another possible use case for invisible XML has presented itself to me. > I need to provide some background information, and then describe the use > case. And then I have some questions for the the collective wisdom of > this discussion group. > > > BACKGROUND > > This afternoon I learned that Dick Grune's web site has, in a collection > headed "ALGOL68 Legacy Data", an item described thus: > > * Revised Report on the Algorithmic Language ALGOL 68 (1975) > The text in its original, charming, totally unique and totally > unexplained format. > > A brief extract showing the beginning of the Introduction gives an idea > of the unexplained format: > > 0 325. * > 0 326. |page||np| * > 0 327. |h|0. |bs|Introduction|rs| * > 0 328. * > 0 329. |h|0.1. Aims and principles of design * > 0 330. * > 0 331. |pb1||j|a)|tp1|In designing the Algorithmic Language ALGOL 68, Working Group * > 0 332. 2.1 on ALGOL of the International Federation for Information Processing * > 0 333. expresses its belief in the value of a common programming language * > 0 334. serving many people in many countries. * > 0 335. |pb1|b)|tp1|ALGOL 68 is designed to communicate algorithms, to execute * > 0 336. them efficiently on a variety of different computers, and to aid in * > 0 337. teaching them to students. * > 0 338. |pb1|c)|tp1|This present Revision of the language is made in response to the * > 0 339. directive of the parent committee, I.F.I.P.#TC#2, to the Working Group to * > 0 340. "keep continually under review experience obtained as a consequence of * > 0 341. this |co|original|cc| publication, so that it may institute such corrections * > > [Digression: The Legacy Data web page says "All of [the files] are > punched card images, in differing formats." What we see above looks to > me like the line printer output of a program for printing card decks (or > files) double spaced on a line printer, with line numbers. The actual > card images appear to begin after the whitespace following the line > numbers.] > > This looks a lot like input to to a batch formatter of some kind, and > the meaning of the control sequences enclosed in |...| can be > conjectured by comparison with a printed copy of the Algol 68 report. > The sequence |co| and the sequence |cc| appear to generate a left brace > and a right brace, respectively, in the printed output, and thus to mark > the material between them as a "'pragmatic' remark" not formally part of > the definition of the language. (What some documents call a > non-normative note.) Perhaps co and cc for 'comment open' and 'comment > close'? > > It would be a lot easier to do interesting things with the text of the > Algol 68 report if it were available in a decent XML representation. > There is an HTML version prepared by one Marcel van der Meer, which > illustrates the kind of thing one could do. It is accessible through > the Internet Archive's Wayback Machine, but the markup is ... > disappointing enough, both before and after running it through tidy, > that I think I would rather work from Grune's original than from that > HTML, if I ever wanted to do something with the text. > > > USE CASE > > Can ixml be used to translate material in formats like this into XML > that would be easier to work with? > > The ixml work flow will be familiar to anyone who has used ixml to work > with pre-existing data that doesn't have an explicit grammar: make a > grammar, look at the output, and revise the grammar until the output is > acceptable for further processing. > > > QUESTIONS > > Experience teaches me that there is an alternative to the use of ixml > for a task like this. I wonder whether there are principled ways to > choose which work flow to use, when. > > A few years ago faced with a similar task (different batch formatter, > and one whose syntax and semantics I know well: Waterloo GML with a lot > of native Waterlook Script interspersed) I followed a work flow like > this: > > 1 Translate the non-XML data into XML in the simplest and most direct > way: make a document with the same structure and semantics, turning > every control sequence in the input into an XML tag. If there are > known begin/end pairs, turn them into start- and end-tags for an > element; turn other control sequences into empty elements. > > After this step, the beginning of the introduction might look like > this: > > > 0. Introduction > > 0.1. Aims and principles of design > > a)In designing the Algorithmic Language ALGOL 68, Working Group > 2.1 on ALGOL of the International Federation for Information Processing > ... > > 2 Then write a series of transforms to move the markup closer to the > desired structure and semantics. Some of the work is likely to be > uphill, and feel like markup enrichment / enhancement; some of the > work is likely to feel like routine cleanup; some is likely to feel > like routine transduction into a new format. A pipeline of multiple > smaller steps can be used. > > For that task a few years ago, I had no ixml processor available, so I > did the first step in the simplest possible way -- I probably didn't > even write a stylesheet, just spent a few hours in emacs with the > document. > > But in general, can we identify rules for dealing with this kind of > situation? For people like me who program by preference in XSLT and > XQuery, and find ixml easy to use, I think the general principles might > be these: > > * In general, start with ixml. > > Even if the structure of control sequences is as simple as the regex > '\|[^\|]+\|', use ixml, not an editor or a transform with a bunch of > regex matches. Ixml provides a record of the transformation (much > better than "oh, it was a couple of hours' work in Emacs"). It > makes it easy to handle different control sequences differently, for > whatever reason. > > And crucially, it makes it a lot easier to capture larger > structures. > > * Work to capture as much of the structure of the input in your ixml > grammar as you can. A context-free grammar can capture structures > whose beginnings are marked and whose ends are often not > marked. It's possible to recognize implicit structure in XQuery and > XSLT, too, and make it explicit, but in my experience (such as it > is), doing it in a context-free grammar is simpler and feels > lighter-weight. It may be only my imagination that says it's easier > to get it right in ixml. But maybe my imagination is telling the > truth. > > * There will come a time when further improvements to the XML produced > by your ixml grammar will become difficult (whatever the word > "improvements" ends up meaning in your context). That's when you > declare the ixml grammar done and write the rest of your processing > pipeline using XSLT and XQuery. > > If anyone reading this has further advice, or different advice, I would > be glad to hear it. > > Michael > > >
Received on Tuesday, 18 April 2023 01:02:01 UTC