- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Sun, 18 Jun 2023 09:03:04 -0600
- To: ixml <public-ixml@w3.org>
This morning I find myself thinking again about a problem I have thought about before without ever finding a solution. I wonder if anyone who reads this list will have a useful idea. Invisible XML is very helpful for the situations where I need to discern the structure of unmarked text and represent it in XML. This is true whether the unmarked text is in a file by itself, or forms the text-data content of an XML element. If I have an ixml parser available in XSLT or XQuery I can call the parser on the text node of the containing element, or its string value; if I don't, I can write a stylesheet to dump the relevant text nodes to files. And I can replace the containing element with the XML element produced by the parser, or keep both versions within a containing element. So I may have a formula in a logic text written as <formula>ascii transcription of formula</formula> or with multiple parallel representations of the formula gathered in a formula-group element: <formula-group> <ascii>ascii transcription of formul</ascii> <vxml>XML representation of formula in XML produced by ixml parser</vxml> <mathml>...</mathml> ... </formula-group> So far, so good. But sometimes what I need to parse is not PCDATA content but mixed content: a mixture of text nodes and XML elements. (And, in practice, also XML comments and processing instructions.) For example, I once wrote code to recognize names of people in a database of information about Roman legal disputes (Trials in the Late Roman Republic 149-50 BC). It's easy enough to write a grammar to recognize strings like Q. Lutatius Catulus (7) cos. 102 Sex. Lucilius (15) tr. pl. 87 and parse them into praenomen, nomen, cognomen, Realenzyklopädie-number (the Quintus Lutatius Catulus mentioned here is the one described by the seventh article under that name in Pauly and Wissowa's Realenzyklopädie), and highest office attained plus date of that office. And it's possible to recognize a series of such names and parse each of them. But since our information about Roman history is sometimes complicated and requires clarification, sometimes what I had to parse was a sequence of names with footnotes interspersed, like a 'defendant' element reading: <defendant> (L. or Q.?) Hortensius (2) cos. des.?<note><p>Since a magistrate in office could not be prosecuted, it seems likely that he was convicted before taking office. See Atkinson (1960) 462 n. 108; Swan (1966) 239-40; and Weinrib (1971) 145 n. 1.</p></note> 108 </defendant> Second example: In a literate programming system, it is sometimes desired to pretty-print the source code -- in his WEB system, for example, Knuth devotes a lot of effort to typesetting Pascal code in the style pioneered for Algol by Peter Naur and various ACM publications. Some LP systems -- like Knuth's WEB and the later CWEB -- support only a single programming language, because the system includes a parser (or sorts) and typesetter for the language. Polyglot literate programming systems either eschew pretty-printing entirely and display the source code letter by letter as it appears in the source, so it looks very much like what the programmer was used to seeing in vi or emacs, in the days before those editors supported syntax highlighting. Some LP systems, like Norman Ramsey's noweb, support pretty-printing by supplying language-specific pretty-printing filters which handle code in a given language, and allowing users to supply their own pretty-printing filters to support new languages or to change the styling of typeset code. Obviously, it's possible to parse fragments of source code in order to recognize their structure and typeset them suitably, styling keywords and variable names differently, and so on. And it's possible to extend the language grammar to deal with the fact that one or more statements, or a condition, or some other bit of code may be replaced by a reference to another code scrap in which that code is given. But in my literate programming system, cross references to other code scraps are tagged as XML elements. So where Knuth might write @<Program to print...@>= program print_primes(output); const @!m=1000; @<Other constants of the program@>@; var @<Variables of the program@>@; begin @<Print the first |m| prime numbers@>; end. and a pretty-printer for WEB could parse the embedded @<...@> sequences as cross-references, in my XML-based LP system this code scrap would look something like this: <scrap file="primes.pas" n="Program to print the first thousand prime numbers"> program print_primes(output); const m=1000; <ptr target="constants"/>; var <ptr target="vars"/>; begin <ptr target="print-m-primes"/> end. </scrap> In a grammar for Pascal or a similar language, it's not hard to recognize the 'begin ... end' as a block. But I have not found a good way to recognize the 'begin ... end' here as a block, given that what an ixml parser called on the text nodes of this 'scrap' element will see at that point is one text node reading ";
begin
 " and another reading "
end.". Does anyone have ideas about the best way of using ixml to allow the enrichment of material that is already partially marked up? Michael -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Sunday, 18 June 2023 15:56:38 UTC