- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Mon, 19 Jun 2023 10:46:29 +0000
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
- Message-Id: <1687170941124.1985676207.2101063659@cwi.nl>
I may be missing what the problem is, but can't you just parse the elements?
I have a basic ixml grammar for XML that I think I used for a paper long ago, and it is not intended to capture all of XML. It may hurt your eyes, so people of a sensitive nature should look away now:
xml: s, element, s.
element: -"<", s, name, (attribute)*, (-">", content, -"</", close, -">"; -"/>").
@name: [L]+, s.
@close: name.
attribute: name, -"=", s, value.
@value: -'"', dchar*, -'"', s; -"'", schar*, -"'", s.
content: (cchar; element)*.
-dchar: ~['"'; "<"].
-schar: ~["'"; "<"].
-cchar: ~["<"].
-s: -[" "; #a; #d; #9]*.
It is naturally enough permissive, above all because XML is not context-free, but yields output like
Input:
<test lang="en" class='test'>
This <em>is</em> a test.
</test>
Output:
<xml>
<element name='test' close='test'>
<attribute name='lang' value='en'/>
<attribute name='class' value='test'/>
<content>
This
<element name='em' close='em'>
<content>is</content>
</element> a test.
</content>
</element>
</xml>
Steven
On Sunday 18 June 2023 17:03:04 (+02:00), C. M. Sperberg-McQueen wrote:
> This morning I find myself thinking again about a problem I have thought
> about before without ever finding a solution. I wonder if anyone who
> reads this list will have a useful idea.
>
> Invisible XML is very helpful for the situations where I need to discern
> the structure of unmarked text and represent it in XML. This is true
> whether the unmarked text is in a file by itself, or forms the text-data
> content of an XML element.
>
> If I have an ixml parser available in XSLT or XQuery I can call the
> parser on the text node of the containing element, or its string value;
> if I don't, I can write a stylesheet to dump the relevant text nodes to
> files. And I can replace the containing element with the XML element
> produced by the parser, or keep both versions within a containing
> element. So I may have a formula in a logic text written as
>
> ascii transcription of formula
>
> or with multiple parallel representations of the formula gathered in a
> formula-group element:
>
>
> ascii transcription of formul
> XML representation of formula in XML produced by ixml
> parser
> ...
> ...
>
>
> So far, so good.
>
> But sometimes what I need to parse is not PCDATA content but mixed
> content: a mixture of text nodes and XML elements. (And, in practice,
> also XML comments and processing instructions.)
>
> For example, I once wrote code to recognize names of people in a
> database of information about Roman legal disputes (Trials in the Late
> Roman Republic 149-50 BC). It's easy enough to write a grammar to
> recognize strings like
>
> Q. Lutatius Catulus (7) cos. 102
> Sex. Lucilius (15) tr. pl. 87
>
> and parse them into praenomen, nomen, cognomen, Realenzyklopädie-number
> (the Quintus Lutatius Catulus mentioned here is the one described by the
> seventh article under that name in Pauly and Wissowa's
> Realenzyklopädie), and highest office attained plus date of that office.
> And it's possible to recognize a series of such names and parse each of
> them.
>
> But since our information about Roman history is sometimes complicated
> and requires clarification, sometimes what I had to parse was a sequence
> of names with footnotes interspersed, like a 'defendant' element reading:
>
>
> (L. or Q.?) Hortensius (2) cos. des.?
Since a magistrate in
> office could not be prosecuted, it seems likely that he was convicted
> before taking office. See Atkinson (1960) 462 n. 108; Swan (1966)
> 239-40; and Weinrib (1971) 145 n. 1.
108
>
>
> Second example: In a literate programming system, it is sometimes
> desired to pretty-print the source code -- in his WEB system, for
> example, Knuth devotes a lot of effort to typesetting Pascal code in the
> style pioneered for Algol by Peter Naur and various ACM publications.
> Some LP systems -- like Knuth's WEB and the later CWEB -- support only a
> single programming language, because the system includes a parser (or
> sorts) and typesetter for the language. Polyglot literate programming
> systems either eschew pretty-printing entirely and display the source
> code letter by letter as it appears in the source, so it looks very much
> like what the programmer was used to seeing in vi or emacs, in the days
> before those editors supported syntax highlighting. Some LP systems,
> like Norman Ramsey's noweb, support pretty-printing by supplying
> language-specific pretty-printing filters which handle code in a given
> language, and allowing users to supply their own pretty-printing filters
> to support new languages or to change the styling of typeset code.
>
> Obviously, it's possible to parse fragments of source code in order to
> recognize their structure and typeset them suitably, styling keywords
> and variable names differently, and so on. And it's possible to extend
> the language grammar to deal with the fact that one or more statements,
> or a condition, or some other bit of code may be replaced by a reference
> to another code scrap in which that code is given.
>
> But in my literate programming system, cross references to other code
> scraps are tagged as XML elements. So where Knuth might write
>
> @=
> program print_primes(output);
> const @!m=1000;
> @@;
> var @@;
> begin @;
> end.
>
> and a pretty-printer for WEB could parse the embedded @<...@> sequences
> as cross-references, in my XML-based LP system this code scrap would
> look something like this:
>
>
> n="Program to print the first thousand prime numbers">
> program print_primes(output);
> const m=1000;
> ;
> var ;
> begin
>
> end.
>
>
> In a grammar for Pascal or a similar language, it's not hard to
> recognize the 'begin ... end' as a block. But I have not found a good
> way to recognize the 'begin ... end' here as a block, given that what an
> ixml parser called on the text nodes of this 'scrap' element will see at
> that point is one text node reading "; begin " and another
> reading " end.".
>
>
> Does anyone have ideas about the best way of using ixml to allow the
> enrichment of material that is already partially marked up?
>
> Michael
>
Received on Monday, 19 June 2023 10:46:44 UTC