Re: Ambiguity (what else!?) question from David Birnbaum on 2025-01-28 (public-ixml@w3.org from January 2025)

From: David Birnbaum <djbpitt@gmail.com>
Date: Mon, 27 Jan 2025 23:14:20 -0500
To: Bethan Tovey-Walsh <bytheway@linguacelta.com>
Cc: ixml <public-ixml@w3.org>
Message-ID: <CAP4v81r8Ns5xEnacj8cNqrmMYyLQt9oQdco3VTHtW1juJWHpbA@mail.gmail.com>
Dear Bethan (cc public-isml),

Thank you! Very helpful! The solution eventually wound up depending on a
combination, as your example shows, of newline details before and after
plus roman-numeral and other character-specific detail in the title line
itself. The newlines distinguish a chapter-title line from a one-line
paragraph and the roman numeral distinguishes chapter title lines from the
title line before the annoying subsection.

Best,

David

On Mon, Jan 27, 2025 at 6:51 PM Bethan Tovey-Walsh <bytheway@linguacelta.com>
wrote:

> Hi, David,
>
> I can't test this, because I only have my phone with me, but would
> something like the below work? It assumes chapter numbers only go up to 99,
> but could be adapted if this is some hugely lengthy tome.
>
> There are presumably no instances of regular text content with four
> newlines preceding it and two following it? If that's so, and you represent
> your text as a sequence paragraphs separated by newlines, body text
> shouldn't be confusable with headings.
>
> I hope this helps - and that it's not too horribly full of errors. We need
> an iXML processor for iOS so I can test grammars at unsociable hours!
>
> BTW
>
>
> units = "I", ("I", "I"?)?.
>
> tens = "X", ("X", "X"?)?.
>
> subHundred = ("XC" | "X"?, "L" | "L"?, tens), subTen?.
>
> subTen = "I"?, "V" | "V"?, units | "I", "X".
>
> roman =  subHundred | subTen.
>
> -newline = {define newlines however works best for you}.
>
> -headingSep = newline, newline.
>
> -punct = {define your punctuation characters here}.
>
> -headWord = ([Lu] | punct)+.
>
> chapterHead = headingSep, headingSep, roman, ". ", headWord++" ",
> headingSep.
>
> subHead = headingSep, headingSep, headWord++" ", headingSep.
>
>
> ****************************************************
>
> Dr. Bethan Tovey-Walsh
>
> linguacelta.com
>
> Golygydd | Editor geirfan.cymru
>
> Croeso i chi ysgrifennu ataf yn y Gymraeg
>
> On 27 Jan 2025, at 22:34, David Birnbaum <djbpitt@gmail.com> wrote:
>
> 
> Dear ixml list,
>
> I'm using ixml to tag a plain-text novel where chapters begin with a roman
> numeral, a dot, a space, and an upper-case title (which may include spaces
> and a few punctuation marks), e.g.:
>
> VI. FAKE TITLE FOR CHAPTER SIX
>
> That pattern is easy to model; the issue is that my model for a line of
> regular narrative text (which may contain all of those characters and more)
> overlaps with it. That is, a line of regular text may include all of the
> characters allowed in a chapter-title line, except that a line of regular
> text never begins with something that matches a roman numeral followed by a
> dot.
>
> To make matters more complicated, there is one embedded subsection, within
> a chapter, that has a sub-title that is all upper-case, but without the
> leading roman numeral, along the lines of:
>
> FAKE HEADING FOR SUBSECTION EMBEDDED INSIDE NUMBERED CHAPTER
>
> Chapter-title lines are preceded by four newlines and followed by two,
> which is a pattern that I might have been able to use except that it is
> also the case with the embedded-subsection title line.
>
> I can get from plain text to XML with pipelining (for that matter, I can
> do it with a pure-XSLT pipeline) because with pipelining I can tag just the
> chapter-title lines first and then go back and tag the rest, having taken
> the chapter-heading lines out of consideration on the first pass. And with
> ixml if I rely on the four newlines before and two newlines after both
> chapter-title lines and the subsection title line I can tag all of those
> the same way and then patch up the incorrect tagging of the embedded
> subsection with a separate, subsequent XSLT step. But …
>
> In the interest of learning How To Do Stuff with ixml, I'd like to
> understand whether it's possible to write an unambiguous ixml grammar to
> tag the document in a way that recognizes chapter-heading lines and does
> not confuse them with either regular text lines or the annoying embedded
> subsection header line. Is there an ixml idiom for this that I haven't
> learned yet, or am I asking ixml to do something it isn't designed to do?
>
> Thanks in advance for any clarification!
>
> Best,
>
> David (djbpitt@gmail.com)
>
>
Received on Tuesday, 28 January 2025 04:14:35 UTC