- From: David Birnbaum <djbpitt@gmail.com>
- Date: Mon, 27 Jan 2025 23:14:20 -0500
- To: Bethan Tovey-Walsh <bytheway@linguacelta.com>
- Cc: ixml <public-ixml@w3.org>
- Message-ID: <CAP4v81r8Ns5xEnacj8cNqrmMYyLQt9oQdco3VTHtW1juJWHpbA@mail.gmail.com>
Dear Bethan (cc public-isml), Thank you! Very helpful! The solution eventually wound up depending on a combination, as your example shows, of newline details before and after plus roman-numeral and other character-specific detail in the title line itself. The newlines distinguish a chapter-title line from a one-line paragraph and the roman numeral distinguishes chapter title lines from the title line before the annoying subsection. Best, David On Mon, Jan 27, 2025 at 6:51 PM Bethan Tovey-Walsh <bytheway@linguacelta.com> wrote: > Hi, David, > > I can't test this, because I only have my phone with me, but would > something like the below work? It assumes chapter numbers only go up to 99, > but could be adapted if this is some hugely lengthy tome. > > There are presumably no instances of regular text content with four > newlines preceding it and two following it? If that's so, and you represent > your text as a sequence paragraphs separated by newlines, body text > shouldn't be confusable with headings. > > I hope this helps - and that it's not too horribly full of errors. We need > an iXML processor for iOS so I can test grammars at unsociable hours! > > BTW > > > units = "I", ("I", "I"?)?. > > tens = "X", ("X", "X"?)?. > > subHundred = ("XC" | "X"?, "L" | "L"?, tens), subTen?. > > subTen = "I"?, "V" | "V"?, units | "I", "X". > > roman = subHundred | subTen. > > -newline = {define newlines however works best for you}. > > -headingSep = newline, newline. > > -punct = {define your punctuation characters here}. > > -headWord = ([Lu] | punct)+. > > chapterHead = headingSep, headingSep, roman, ". ", headWord++" ", > headingSep. > > subHead = headingSep, headingSep, headWord++" ", headingSep. > > > **************************************************** > > Dr. Bethan Tovey-Walsh > > linguacelta.com > > Golygydd | Editor geirfan.cymru > > Croeso i chi ysgrifennu ataf yn y Gymraeg > > On 27 Jan 2025, at 22:34, David Birnbaum <djbpitt@gmail.com> wrote: > > > Dear ixml list, > > I'm using ixml to tag a plain-text novel where chapters begin with a roman > numeral, a dot, a space, and an upper-case title (which may include spaces > and a few punctuation marks), e.g.: > > VI. FAKE TITLE FOR CHAPTER SIX > > That pattern is easy to model; the issue is that my model for a line of > regular narrative text (which may contain all of those characters and more) > overlaps with it. That is, a line of regular text may include all of the > characters allowed in a chapter-title line, except that a line of regular > text never begins with something that matches a roman numeral followed by a > dot. > > To make matters more complicated, there is one embedded subsection, within > a chapter, that has a sub-title that is all upper-case, but without the > leading roman numeral, along the lines of: > > FAKE HEADING FOR SUBSECTION EMBEDDED INSIDE NUMBERED CHAPTER > > Chapter-title lines are preceded by four newlines and followed by two, > which is a pattern that I might have been able to use except that it is > also the case with the embedded-subsection title line. > > I can get from plain text to XML with pipelining (for that matter, I can > do it with a pure-XSLT pipeline) because with pipelining I can tag just the > chapter-title lines first and then go back and tag the rest, having taken > the chapter-heading lines out of consideration on the first pass. And with > ixml if I rely on the four newlines before and two newlines after both > chapter-title lines and the subsection title line I can tag all of those > the same way and then patch up the incorrect tagging of the embedded > subsection with a separate, subsequent XSLT step. But … > > In the interest of learning How To Do Stuff with ixml, I'd like to > understand whether it's possible to write an unambiguous ixml grammar to > tag the document in a way that recognizes chapter-heading lines and does > not confuse them with either regular text lines or the annoying embedded > subsection header line. Is there an ixml idiom for this that I haven't > learned yet, or am I asking ixml to do something it isn't designed to do? > > Thanks in advance for any clarification! > > Best, > > David (djbpitt@gmail.com) > >
Received on Tuesday, 28 January 2025 04:14:35 UTC