using ixml with mixed content - a design problem

This morning I find myself thinking again about a problem I have thought
about before without ever finding a solution.  I wonder if anyone who
reads this list will have a useful idea.

Invisible XML is very helpful for the situations where I need to discern
the structure of unmarked text and represent it in XML.  This is true
whether the unmarked text is in a file by itself, or forms the text-data
content of an XML element.

If I have an ixml parser available in XSLT or XQuery I can call the
parser on the text node of the containing element, or its string value;
if I don't, I can write a stylesheet to dump the relevant text nodes to
files.  And I can replace the containing element with the XML element
produced by the parser, or keep both versions within a containing
element.  So I may have a formula in a logic text written as

    <formula>ascii transcription of formula</formula>

or with multiple parallel representations of the formula gathered in a
formula-group element:

    <formula-group>
      <ascii>ascii transcription of formul</ascii>
      <vxml>XML representation of formula in XML produced by ixml
      parser</vxml>
      <mathml>...</mathml>
      ...
    </formula-group>

So far, so good.

But sometimes what I need to parse is not PCDATA content but mixed
content:  a mixture of text nodes and XML elements.  (And, in practice,
also XML comments and processing instructions.)

For example, I once wrote code to recognize names of people in a
database of information about Roman legal disputes (Trials in the Late
Roman Republic 149-50 BC).  It's easy enough to write a grammar to
recognize strings like

  Q. Lutatius Catulus (7) cos. 102
  Sex. Lucilius (15) tr. pl. 87

and parse them into praenomen, nomen, cognomen, Realenzyklopädie-number
(the Quintus Lutatius Catulus mentioned here is the one described by the
seventh article under that name in Pauly and Wissowa's
Realenzyklopädie), and highest office attained plus date of that office.
And it's possible to recognize a series of such names and parse each of
them.

But since our information about Roman history is sometimes complicated
and requires clarification, sometimes what I had to parse was a sequence
of names with footnotes interspersed, like a 'defendant' element reading:

  <defendant>
    (L. or Q.?) Hortensius (2) cos. des.?<note><p>Since a magistrate in
    office could not be prosecuted, it seems likely that he was convicted
    before taking office.  See Atkinson (1960) 462 n. 108; Swan (1966)
    239-40; and Weinrib (1971) 145 n. 1.</p></note> 108
  </defendant>

Second example: In a literate programming system, it is sometimes
desired to pretty-print the source code -- in his WEB system, for
example, Knuth devotes a lot of effort to typesetting Pascal code in the
style pioneered for Algol by Peter Naur and various ACM publications.
Some LP systems -- like Knuth's WEB and the later CWEB -- support only a
single programming language, because the system includes a parser (or
sorts) and typesetter for the language.  Polyglot literate programming
systems either eschew pretty-printing entirely and display the source
code letter by letter as it appears in the source, so it looks very much
like what the programmer was used to seeing in vi or emacs, in the days
before those editors supported syntax highlighting.  Some LP systems,
like Norman Ramsey's noweb, support pretty-printing by supplying
language-specific pretty-printing filters which handle code in a given
language, and allowing users to supply their own pretty-printing filters
to support new languages or to change the styling of typeset code.

Obviously, it's possible to parse fragments of source code in order to
recognize their structure and typeset them suitably, styling keywords
and variable names differently, and so on.  And it's possible to extend
the language grammar to deal with the fact that one or more statements,
or a condition, or some other bit of code may be replaced by a reference
to another code scrap in which that code is given.

But in my literate programming system, cross references to other code
scraps are tagged as XML elements.  So where Knuth might write

  @<Program to print...@>=
  program print_primes(output);
  const @!m=1000;
  @<Other constants of the program@>@;
  var @<Variables of the program@>@;
  begin @<Print the first |m| prime numbers@>;
  end.

and a pretty-printer for WEB could parse the embedded @<...@> sequences
as cross-references, in my XML-based LP system this code scrap would
look something like this:

  <scrap file="primes.pas"
         n="Program to print the first thousand prime numbers">
  program print_primes(output);
    const m=1000;
          <ptr target="constants"/>;
    var <ptr target="vars"/>;
  begin
      <ptr target="print-m-primes"/>
  end.
  </scrap>

In a grammar for Pascal or a similar language, it's not hard to
recognize the 'begin ... end' as a block.  But I have not found a good
way to recognize the 'begin ... end' here as a block, given that what an
ixml parser called on the text nodes of this 'scrap' element will see at
that point is one text node reading ";&#xA;begin&#xA;    " and another
reading "&#xA;end.".


Does anyone have ideas about the best way of using ixml to allow the
enrichment of material that is already partially marked up?

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Sunday, 18 June 2023 15:56:38 UTC