- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Sun, 16 Apr 2023 15:59:27 -0600
- To: ixml <public-ixml@w3.org>
Another possible use case for invisible XML has presented itself to me. I need to provide some background information, and then describe the use case. And then I have some questions for the the collective wisdom of this discussion group. BACKGROUND This afternoon I learned that Dick Grune's web site has, in a collection headed "ALGOL68 Legacy Data", an item described thus: * Revised Report on the Algorithmic Language ALGOL 68 (1975) The text in its original, charming, totally unique and totally unexplained format. A brief extract showing the beginning of the Introduction gives an idea of the unexplained format: 0 325. * 0 326. |page||np| * 0 327. |h|0. |bs|Introduction|rs| * 0 328. * 0 329. |h|0.1. Aims and principles of design * 0 330. * 0 331. |pb1||j|a)|tp1|In designing the Algorithmic Language ALGOL 68, Working Group * 0 332. 2.1 on ALGOL of the International Federation for Information Processing * 0 333. expresses its belief in the value of a common programming language * 0 334. serving many people in many countries. * 0 335. |pb1|b)|tp1|ALGOL 68 is designed to communicate algorithms, to execute * 0 336. them efficiently on a variety of different computers, and to aid in * 0 337. teaching them to students. * 0 338. |pb1|c)|tp1|This present Revision of the language is made in response to the * 0 339. directive of the parent committee, I.F.I.P.#TC#2, to the Working Group to * 0 340. "keep continually under review experience obtained as a consequence of * 0 341. this |co|original|cc| publication, so that it may institute such corrections * [Digression: The Legacy Data web page says "All of [the files] are punched card images, in differing formats." What we see above looks to me like the line printer output of a program for printing card decks (or files) double spaced on a line printer, with line numbers. The actual card images appear to begin after the whitespace following the line numbers.] This looks a lot like input to to a batch formatter of some kind, and the meaning of the control sequences enclosed in |...| can be conjectured by comparison with a printed copy of the Algol 68 report. The sequence |co| and the sequence |cc| appear to generate a left brace and a right brace, respectively, in the printed output, and thus to mark the material between them as a "'pragmatic' remark" not formally part of the definition of the language. (What some documents call a non-normative note.) Perhaps co and cc for 'comment open' and 'comment close'? It would be a lot easier to do interesting things with the text of the Algol 68 report if it were available in a decent XML representation. There is an HTML version prepared by one Marcel van der Meer, which illustrates the kind of thing one could do. It is accessible through the Internet Archive's Wayback Machine, but the markup is ... disappointing enough, both before and after running it through tidy, that I think I would rather work from Grune's original than from that HTML, if I ever wanted to do something with the text. USE CASE Can ixml be used to translate material in formats like this into XML that would be easier to work with? The ixml work flow will be familiar to anyone who has used ixml to work with pre-existing data that doesn't have an explicit grammar: make a grammar, look at the output, and revise the grammar until the output is acceptable for further processing. QUESTIONS Experience teaches me that there is an alternative to the use of ixml for a task like this. I wonder whether there are principled ways to choose which work flow to use, when. A few years ago faced with a similar task (different batch formatter, and one whose syntax and semantics I know well: Waterloo GML with a lot of native Waterlook Script interspersed) I followed a work flow like this: 1 Translate the non-XML data into XML in the simplest and most direct way: make a document with the same structure and semantics, turning every control sequence in the input into an XML tag. If there are known begin/end pairs, turn them into start- and end-tags for an element; turn other control sequences into empty elements. After this step, the beginning of the introduction might look like this: <page/><np/> <h/>0. <bs/>Introduction<rs/> <h/>0.1. Aims and principles of design <pb1/><j/>a)<tp1/>In designing the Algorithmic Language ALGOL 68, Working Group 2.1 on ALGOL of the International Federation for Information Processing ... 2 Then write a series of transforms to move the markup closer to the desired structure and semantics. Some of the work is likely to be uphill, and feel like markup enrichment / enhancement; some of the work is likely to feel like routine cleanup; some is likely to feel like routine transduction into a new format. A pipeline of multiple smaller steps can be used. For that task a few years ago, I had no ixml processor available, so I did the first step in the simplest possible way -- I probably didn't even write a stylesheet, just spent a few hours in emacs with the document. But in general, can we identify rules for dealing with this kind of situation? For people like me who program by preference in XSLT and XQuery, and find ixml easy to use, I think the general principles might be these: * In general, start with ixml. Even if the structure of control sequences is as simple as the regex '\|[^\|]+\|', use ixml, not an editor or a transform with a bunch of regex matches. Ixml provides a record of the transformation (much better than "oh, it was a couple of hours' work in Emacs"). It makes it easy to handle different control sequences differently, for whatever reason. And crucially, it makes it a lot easier to capture larger structures. * Work to capture as much of the structure of the input in your ixml grammar as you can. A context-free grammar can capture structures whose beginnings are marked and whose ends are often not marked. It's possible to recognize implicit structure in XQuery and XSLT, too, and make it explicit, but in my experience (such as it is), doing it in a context-free grammar is simpler and feels lighter-weight. It may be only my imagination that says it's easier to get it right in ixml. But maybe my imagination is telling the truth. * There will come a time when further improvements to the XML produced by your ixml grammar will become difficult (whatever the word "improvements" ends up meaning in your context). That's when you declare the ixml grammar done and write the rest of your processing pipeline using XSLT and XQuery. If anyone reading this has further advice, or different advice, I would be glad to hear it. Michael -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Monday, 17 April 2023 00:48:16 UTC