- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Wed, 29 Jan 2025 13:41:53 +0000
- To: public-ixml@w3.org
- Message-Id: <1738153942649.590815187.4080044715@cwi.nl>
Hey David, Your description is a little under-specified, but off the top of my head, and without looking at what others have replied: I'm assuming a structure like this (I'm ignoring the multiple blank lines, and just using a single one): novel: chapter+. chapter: chapterbegin, paragraph+, subsection*. As you say, a chapter heading is easy to specify: chapters begin with a roman numeral, a dot, a space, and an upper-case title: -chapterbegin: chaptertitle, blankline. chaptertitle: roman, -". ", uppercasetitle. roman: ["IVXLCM"]+. -blankline: -#a. an upper-case title may include spaces and a few punctuation marks uppercasetitle: (uppercase; " "; punctuation)+, -#a. -uppercase: ["A"-"Z"]. -punctuation: [",.;:"]. A subtitle may not begin with a roman numeral followed by a dot but is otherwise the same. subsection: subtitle, blankline, paragraph+. subtitle: ["ABDEFGHJKNOPQRSTUWYZ"], -uppercasetitle; ["IVXLCM"]+, (~["."]; ".", ~[" "]), -uppercasetitle. As far as I can make out, the only thing that distinguishes a line of text, is that it contains a lowercase letter somewhere: paragraph: line+, -#a. line: (uppercase; " "; punctuation)*, ["a"-"z"], ~[#a]*, -#a. With text I. CHAPTER I Should I set out on my autocycle? This was the question with which I began. I had a methodical mind and never set out on a mission without prolonged reflection as to the best way of setting out. SUBSECTION It was the first problem to solve, at the outset of each enquiry, and I never moved until I had solved it to my satisfaction. II. CHAPTER II Sometimes I took my autocycle, sometimes the train, sometimes the motor-coach, just as sometimes too I left on foot, or on my bicycle, silently, in the night. For when you are beset with enemies, as I am, you cannot leave on your autocycle, even in the night, without being noticed, unless you employ it as an ordinary bicycle, which is absurd. I get <novel> <chapter> <chaptertitle> <roman>I</roman> <uppercasetitle>CHAPTER I</uppercasetitle> </chaptertitle> <paragraph> <line>Should I set out on my autocycle?</line> </paragraph> <paragraph> <line>This was the question</line> <line>with which I began. I had a methodical mind and never set out on a</line> <line>mission without prolonged reflection as to the best way of setting</line> <line>out.</line> </paragraph> <subsection> <subtitle>SUBSECTION</subtitle> <paragraph> <line>It was the first problem to solve, at the outset of each</line> <line>enquiry, and I never moved until I had solved it to my</line> <line>satisfaction. </line> </paragraph> </subsection> </chapter> <chapter> <chaptertitle> <roman>II</roman> <uppercasetitle>CHAPTER II</uppercasetitle> </chaptertitle> <paragraph> <line>Sometimes I took my autocycle, sometimes the train,</line> <line>sometimes the motor-coach, just as sometimes too I left on foot, or on</line> <line>my bicycle, silently, in the night.</line> </paragraph> <paragraph> <line>For when you are beset with</line> <line>enemies, as I am, you cannot leave on your autocycle, even in the</line> <line>night, without being noticed, unless you employ it as an ordinary</line> <line>bicycle, which is absurd.</line> </paragraph> </chapter> </novel> If you want to get rid of the <line> elements, you'll have to add a space at the end of each line to replace the newline to avoid words running in to each other: -line: (uppercase; " "; punctuation)*, ["a"-"z"], ~[#a]*, -#a, +" ". Steven On Monday 27 January 2025 23:34:29 (+01:00), David Birnbaum wrote: Dear ixml list, I'm using ixml to tag a plain-text novel where chapters begin with a roman numeral, a dot, a space, and an upper-case title (which may include spaces and a few punctuation marks), e.g.: VI. FAKE TITLE FOR CHAPTER SIX That pattern is easy to model; the issue is that my model for a line of regular narrative text (which may contain all of those characters and more) overlaps with it. That is, a line of regular text may include all of the characters allowed in a chapter-title line, except that a line of regular text never begins with something that matches a roman numeral followed by a dot. To make matters more complicated, there is one embedded subsection, within a chapter, that has a sub-title that is all upper-case, but without the leading roman numeral, along the lines of: FAKE HEADING FOR SUBSECTION EMBEDDED INSIDE NUMBERED CHAPTER Chapter-title lines are preceded by four newlines and followed by two, which is a pattern that I might have been able to use except that it is also the case with the embedded-subsection title line. I can get from plain text to XML with pipelining (for that matter, I can do it with a pure-XSLT pipeline) because with pipelining I can tag just the chapter-title lines first and then go back and tag the rest, having taken the chapter-heading lines out of consideration on the first pass. And with ixml if I rely on the four newlines before and two newlines after both chapter-title lines and the subsection title line I can tag all of those the same way and then patch up the incorrect tagging of the embedded subsection with a separate, subsequent XSLT step. But … In the interest of learning How To Do Stuff with ixml, I'd like to understand whether it's possible to write an unambiguous ixml grammar to tag the document in a way that recognizes chapter-heading lines and does not confuse them with either regular text lines or the annoying embedded subsection header line. Is there an ixml idiom for this that I haven't learned yet, or am I asking ixml to do something it isn't designed to do? Thanks in advance for any clarification! Best, David (djbpitt@gmail.com <mailto:djbpitt@gmail.com> )
Received on Wednesday, 29 January 2025 13:42:00 UTC