- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Mon, 19 Jun 2023 10:46:29 +0000
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
- Message-Id: <1687170941124.1985676207.2101063659@cwi.nl>
I may be missing what the problem is, but can't you just parse the elements? I have a basic ixml grammar for XML that I think I used for a paper long ago, and it is not intended to capture all of XML. It may hurt your eyes, so people of a sensitive nature should look away now: xml: s, element, s. element: -"<", s, name, (attribute)*, (-">", content, -"</", close, -">"; -"/>"). @name: [L]+, s. @close: name. attribute: name, -"=", s, value. @value: -'"', dchar*, -'"', s; -"'", schar*, -"'", s. content: (cchar; element)*. -dchar: ~['"'; "<"]. -schar: ~["'"; "<"]. -cchar: ~["<"]. -s: -[" "; #a; #d; #9]*. It is naturally enough permissive, above all because XML is not context-free, but yields output like Input: <test lang="en" class='test'> This <em>is</em> a test. </test> Output: <xml> <element name='test' close='test'> <attribute name='lang' value='en'/> <attribute name='class' value='test'/> <content> This <element name='em' close='em'> <content>is</content> </element> a test. </content> </element> </xml> Steven On Sunday 18 June 2023 17:03:04 (+02:00), C. M. Sperberg-McQueen wrote: > This morning I find myself thinking again about a problem I have thought > about before without ever finding a solution. I wonder if anyone who > reads this list will have a useful idea. > > Invisible XML is very helpful for the situations where I need to discern > the structure of unmarked text and represent it in XML. This is true > whether the unmarked text is in a file by itself, or forms the text-data > content of an XML element. > > If I have an ixml parser available in XSLT or XQuery I can call the > parser on the text node of the containing element, or its string value; > if I don't, I can write a stylesheet to dump the relevant text nodes to > files. And I can replace the containing element with the XML element > produced by the parser, or keep both versions within a containing > element. So I may have a formula in a logic text written as > > ascii transcription of formula > > or with multiple parallel representations of the formula gathered in a > formula-group element: > > > ascii transcription of formul > XML representation of formula in XML produced by ixml > parser > ... > ... > > > So far, so good. > > But sometimes what I need to parse is not PCDATA content but mixed > content: a mixture of text nodes and XML elements. (And, in practice, > also XML comments and processing instructions.) > > For example, I once wrote code to recognize names of people in a > database of information about Roman legal disputes (Trials in the Late > Roman Republic 149-50 BC). It's easy enough to write a grammar to > recognize strings like > > Q. Lutatius Catulus (7) cos. 102 > Sex. Lucilius (15) tr. pl. 87 > > and parse them into praenomen, nomen, cognomen, Realenzyklopädie-number > (the Quintus Lutatius Catulus mentioned here is the one described by the > seventh article under that name in Pauly and Wissowa's > Realenzyklopädie), and highest office attained plus date of that office. > And it's possible to recognize a series of such names and parse each of > them. > > But since our information about Roman history is sometimes complicated > and requires clarification, sometimes what I had to parse was a sequence > of names with footnotes interspersed, like a 'defendant' element reading: > > > (L. or Q.?) Hortensius (2) cos. des.? Since a magistrate in > office could not be prosecuted, it seems likely that he was convicted > before taking office. See Atkinson (1960) 462 n. 108; Swan (1966) > 239-40; and Weinrib (1971) 145 n. 1. 108 > > > Second example: In a literate programming system, it is sometimes > desired to pretty-print the source code -- in his WEB system, for > example, Knuth devotes a lot of effort to typesetting Pascal code in the > style pioneered for Algol by Peter Naur and various ACM publications. > Some LP systems -- like Knuth's WEB and the later CWEB -- support only a > single programming language, because the system includes a parser (or > sorts) and typesetter for the language. Polyglot literate programming > systems either eschew pretty-printing entirely and display the source > code letter by letter as it appears in the source, so it looks very much > like what the programmer was used to seeing in vi or emacs, in the days > before those editors supported syntax highlighting. Some LP systems, > like Norman Ramsey's noweb, support pretty-printing by supplying > language-specific pretty-printing filters which handle code in a given > language, and allowing users to supply their own pretty-printing filters > to support new languages or to change the styling of typeset code. > > Obviously, it's possible to parse fragments of source code in order to > recognize their structure and typeset them suitably, styling keywords > and variable names differently, and so on. And it's possible to extend > the language grammar to deal with the fact that one or more statements, > or a condition, or some other bit of code may be replaced by a reference > to another code scrap in which that code is given. > > But in my literate programming system, cross references to other code > scraps are tagged as XML elements. So where Knuth might write > > @= > program print_primes(output); > const @!m=1000; > @@; > var @@; > begin @; > end. > > and a pretty-printer for WEB could parse the embedded @<...@> sequences > as cross-references, in my XML-based LP system this code scrap would > look something like this: > > > n="Program to print the first thousand prime numbers"> > program print_primes(output); > const m=1000; > ; > var ; > begin > > end. > > > In a grammar for Pascal or a similar language, it's not hard to > recognize the 'begin ... end' as a block. But I have not found a good > way to recognize the 'begin ... end' here as a block, given that what an > ixml parser called on the text nodes of this 'scrap' element will see at > that point is one text node reading "; begin " and another > reading " end.". > > > Does anyone have ideas about the best way of using ixml to allow the > enrichment of material that is already partially marked up? > > Michael >
Received on Monday, 19 June 2023 10:46:44 UTC