Re: using ixml with mixed content - a design problem from Steven Pemberton on 2023-06-19 (public-ixml@w3.org from June 2023)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Mon, 19 Jun 2023 10:46:29 +0000
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
Message-Id: <1687170941124.1985676207.2101063659@cwi.nl>
I may be missing what the problem is, but can't you just parse the elements?


I have a basic ixml grammar for XML that I think I used for a paper long ago, and it is not intended to capture all of XML. It may hurt your eyes, so people of a sensitive nature should look away now:


xml: s, element, s.
element: -"<", s, name, (attribute)*, (-">", content, -"</", close, -">"; -"/>").
@name: [L]+, s.
@close: name.
attribute: name, -"=", s, value.
@value: -'"', dchar*, -'"', s; -"'", schar*, -"'", s.
content: (cchar; element)*.
-dchar: ~['"'; "<"].
-schar: ~["'"; "<"].
-cchar: ~["<"].
-s: -[" "; #a; #d; #9]*.


It is naturally enough permissive, above all because XML is not context-free, but yields output like


Input:
<test lang="en" class='test'>
  This <em>is</em> a test.
</test>


Output:
<xml>
   <element name='test' close='test'>
      <attribute name='lang' value='en'/>
      <attribute name='class' value='test'/>
      <content>
  This 
         <element name='em' close='em'>
            <content>is</content>
         </element> a test.
</content>
   </element>
</xml>


Steven


On Sunday 18 June 2023 17:03:04 (+02:00), C. M. Sperberg-McQueen wrote:



> This morning I find myself thinking again about a problem I have thought

> about before without ever finding a solution. I wonder if anyone who

> reads this list will have a useful idea.

>

> Invisible XML is very helpful for the situations where I need to discern

> the structure of unmarked text and represent it in XML. This is true

> whether the unmarked text is in a file by itself, or forms the text-data

> content of an XML element.

>

> If I have an ixml parser available in XSLT or XQuery I can call the

> parser on the text node of the containing element, or its string value;

> if I don't, I can write a stylesheet to dump the relevant text nodes to

> files. And I can replace the containing element with the XML element

> produced by the parser, or keep both versions within a containing

> element. So I may have a formula in a logic text written as

>

> ascii transcription of formula

>

> or with multiple parallel representations of the formula gathered in a

> formula-group element:

>

>

> ascii transcription of formul

> XML representation of formula in XML produced by ixml

> parser

> ...

> ...

>

>

> So far, so good.

>

> But sometimes what I need to parse is not PCDATA content but mixed

> content: a mixture of text nodes and XML elements. (And, in practice,

> also XML comments and processing instructions.)

>

> For example, I once wrote code to recognize names of people in a

> database of information about Roman legal disputes (Trials in the Late

> Roman Republic 149-50 BC). It's easy enough to write a grammar to

> recognize strings like

>

> Q. Lutatius Catulus (7) cos. 102

> Sex. Lucilius (15) tr. pl. 87

>

> and parse them into praenomen, nomen, cognomen, Realenzyklopädie-number

> (the Quintus Lutatius Catulus mentioned here is the one described by the

> seventh article under that name in Pauly and Wissowa's

> Realenzyklopädie), and highest office attained plus date of that office.

> And it's possible to recognize a series of such names and parse each of

> them.

>

> But since our information about Roman history is sometimes complicated

> and requires clarification, sometimes what I had to parse was a sequence

> of names with footnotes interspersed, like a 'defendant' element reading:

>

>

> (L. or Q.?) Hortensius (2) cos. des.?

Since a magistrate in


> office could not be prosecuted, it seems likely that he was convicted

> before taking office. See Atkinson (1960) 462 n. 108; Swan (1966)

> 239-40; and Weinrib (1971) 145 n. 1.

108

>

>

> Second example: In a literate programming system, it is sometimes

> desired to pretty-print the source code -- in his WEB system, for

> example, Knuth devotes a lot of effort to typesetting Pascal code in the

> style pioneered for Algol by Peter Naur and various ACM publications.

> Some LP systems -- like Knuth's WEB and the later CWEB -- support only a

> single programming language, because the system includes a parser (or

> sorts) and typesetter for the language. Polyglot literate programming

> systems either eschew pretty-printing entirely and display the source

> code letter by letter as it appears in the source, so it looks very much

> like what the programmer was used to seeing in vi or emacs, in the days

> before those editors supported syntax highlighting. Some LP systems,

> like Norman Ramsey's noweb, support pretty-printing by supplying

> language-specific pretty-printing filters which handle code in a given

> language, and allowing users to supply their own pretty-printing filters

> to support new languages or to change the styling of typeset code.

>

> Obviously, it's possible to parse fragments of source code in order to

> recognize their structure and typeset them suitably, styling keywords

> and variable names differently, and so on. And it's possible to extend

> the language grammar to deal with the fact that one or more statements,

> or a condition, or some other bit of code may be replaced by a reference

> to another code scrap in which that code is given.

>

> But in my literate programming system, cross references to other code

> scraps are tagged as XML elements. So where Knuth might write

>

> @=

> program print_primes(output);

> const @!m=1000;

> @@;

> var @@;

> begin @;

> end.

>

> and a pretty-printer for WEB could parse the embedded @<...@> sequences

> as cross-references, in my XML-based LP system this code scrap would

> look something like this:

>

>
> n="Program to print the first thousand prime numbers">

> program print_primes(output);

> const m=1000;

> ;

> var ;

> begin

>

> end.

>

>

> In a grammar for Pascal or a similar language, it's not hard to

> recognize the 'begin ... end' as a block. But I have not found a good

> way to recognize the 'begin ... end' here as a block, given that what an

> ixml parser called on the text nodes of this 'scrap' element will see at

> that point is one text node reading "; begin " and another

> reading " end.".

>

>

> Does anyone have ideas about the best way of using ixml to allow the

> enrichment of material that is already partially marked up?

>

> Michael

>
Received on Monday, 19 June 2023 10:46:44 UTC