Re: Ambiguity (what else!?) question

Hey David,


Your description is a little under-specified, but off the top of my head, and without looking at what others have replied:


I'm assuming a structure like this (I'm ignoring the multiple blank lines, and just using a single one):


 novel: chapter+.
 chapter: chapterbegin, paragraph+, subsection*.


As you say, a chapter heading is easy to specify: chapters begin with a roman numeral, a dot, a space, and an upper-case title:


 -chapterbegin: chaptertitle, blankline.
 chaptertitle: roman, -". ", uppercasetitle.
 roman: ["IVXLCM"]+.
 -blankline: -#a.


an upper-case title may include spaces and a few punctuation marks


  uppercasetitle: (uppercase; " "; punctuation)+, -#a.
 -uppercase: ["A"-"Z"].
 -punctuation: [",.;:"].


A subtitle may not begin with a roman numeral followed by a dot but is otherwise the same.


 subsection: subtitle, blankline, paragraph+.
 subtitle: ["ABDEFGHJKNOPQRSTUWYZ"], -uppercasetitle;
               ["IVXLCM"]+, (~["."]; ".", ~[" "]), -uppercasetitle.


As far as I can make out, the only thing that distinguishes a line of text, is that it contains a lowercase letter somewhere:


 paragraph: line+, -#a. 
 line: (uppercase; " "; punctuation)*, ["a"-"z"], ~[#a]*, -#a.




With text


I. CHAPTER I


Should I set out on my autocycle?


This was the question
with which I began. I had a methodical mind and never set out on a
mission without prolonged reflection as to the best way of setting
out.


SUBSECTION


It was the first problem to solve, at the outset of each
enquiry, and I never moved until I had solved it to my
satisfaction. 


II. CHAPTER II


Sometimes I took my autocycle, sometimes the train,
sometimes the motor-coach, just as sometimes too I left on foot, or on
my bicycle, silently, in the night.


For when you are beset with
enemies, as I am, you cannot leave on your autocycle, even in the
night, without being noticed, unless you employ it as an ordinary
bicycle, which is absurd.




I get


<novel>
   <chapter>
      <chaptertitle>
         <roman>I</roman>
         <uppercasetitle>CHAPTER I</uppercasetitle>
      </chaptertitle>
      <paragraph>
         <line>Should I set out on my autocycle?</line>
      </paragraph>
      <paragraph>
         <line>This was the question</line>
         <line>with which I began. I had a methodical mind and never set out on a</line>
         <line>mission without prolonged reflection as to the best way of setting</line>
         <line>out.</line>
      </paragraph>
      <subsection>
         <subtitle>SUBSECTION</subtitle>
         <paragraph>
            <line>It was the first problem to solve, at the outset of each</line>
            <line>enquiry, and I never moved until I had solved it to my</line>
            <line>satisfaction. </line>
         </paragraph>
      </subsection>
   </chapter>
   <chapter>
      <chaptertitle>
         <roman>II</roman>
         <uppercasetitle>CHAPTER II</uppercasetitle>
      </chaptertitle>
      <paragraph>
         <line>Sometimes I took my autocycle, sometimes the train,</line>
         <line>sometimes the motor-coach, just as sometimes too I left on foot, or on</line>
         <line>my bicycle, silently, in the night.</line>
      </paragraph>
      <paragraph>
         <line>For when you are beset with</line>
         <line>enemies, as I am, you cannot leave on your autocycle, even in the</line>
         <line>night, without being noticed, unless you employ it as an ordinary</line>
         <line>bicycle, which is absurd.</line>
      </paragraph>
   </chapter>
</novel>


If you want to get rid of the <line> elements, you'll have to add a space at the end of each line to replace the newline to avoid words running in to each other:


 -line: (uppercase; " "; punctuation)*, ["a"-"z"], ~[#a]*, -#a, +" ".


Steven

On Monday 27 January 2025 23:34:29 (+01:00), David Birnbaum wrote:


Dear ixml list,


I'm using ixml to tag a plain-text novel where chapters begin with a roman numeral, a dot, a space, and an upper-case title (which may include spaces and a few punctuation marks), e.g.:


VI. FAKE TITLE FOR CHAPTER SIX


That pattern is easy to model; the issue is that my model for a line of regular narrative text (which may contain all of those characters and more) overlaps with it. That is, a line of regular text may include all of the characters allowed in a chapter-title line, except that a line of regular text never begins with something that matches a roman numeral followed by a dot.


To make matters more complicated, there is one embedded subsection, within a chapter, that has a sub-title that is all upper-case, but without the leading roman numeral, along the lines of:


FAKE HEADING FOR SUBSECTION EMBEDDED INSIDE NUMBERED CHAPTER


Chapter-title lines are preceded by four newlines and followed by two, which is a pattern that I might have been able to use except that it is also the case with the embedded-subsection title line.


I can get from plain text to XML with pipelining (for that matter, I can do it with a pure-XSLT pipeline) because with pipelining I can tag just the chapter-title lines first and then go back and tag the rest, having taken the chapter-heading lines out of consideration on the first pass. And with ixml if I rely on the four newlines before and two newlines after both chapter-title lines and the subsection title line I can tag all of those the same way and then patch up the incorrect tagging of the embedded subsection with a separate, subsequent XSLT step. But …


In the interest of learning How To Do Stuff with ixml, I'd like to understand whether it's possible to write an unambiguous ixml grammar to tag the document in a way that recognizes chapter-heading lines and does not confuse them with either regular text lines or the annoying embedded subsection header line. Is there an ixml idiom for this that I haven't learned yet, or am I asking ixml to do something it isn't designed to do?


Thanks in advance for any clarification!


Best,


David (djbpitt@gmail.com <mailto:djbpitt@gmail.com> )

Received on Wednesday, 29 January 2025 13:42:00 UTC