Re: Ambiguity (what else!?) question from David Birnbaum on 2025-01-28 (public-ixml@w3.org from January 2025)

From: David Birnbaum <djbpitt@gmail.com>
Date: Mon, 27 Jan 2025 23:56:06 -0500
To: LdBeth <andpuke@foxmail.com>
Cc: ixml <public-ixml@w3.org>
Message-ID: <CAP4v81qFrmq8gtTw4aOWsND71QBg4pk1NhTXKi1h3-cbsBxhUg@mail.gmail.com>
Dear LdBeth (cc public-ixml),

Thank you for your generous explanation! Although I was able to grapple my
way to Something That Worked by following Bethan's good advice (on this
list) and relying on a combination of fussy newline-counting and roman
numeral patterns, your response drew my attention to two issues that I had
considered with insufficient care:

1. I gave up too early on crafting a pattern for a regular text line that
would exclude chapter titles. Your example helped me notice a regularity
that is obvious and self-explanatory in retrospect: a normal text line
never has a dot or an upper-case letter in second position except when the
line begins with a quotation mark, while chapter title lines and the
subsection title line always have a dot or an upper-case letter in second
position. That's enough to enable the sort of exclusion approach you
describe. This reason was obvious once I had noticed it; the regular text
has no words in all caps and no one-letter words that can appear at the end
of a sentence, so an upper-case letter can be in second position in a line
only when preceding by a quotation mark (the beginning of a quoted
sentence, and therefore not a chapter title) or when part of a title. Once
I incorporate the not-a-quotation-mark into a pattern, it isn't difficult
to write, even if it may not be very easy to read and understand without an
explanatory comment.

2. A regex approach is easier for me because I have a lot of experience
with using regex, including inside XSLT, to add markup to plain text, while
I'm still pretty much a beginner with ixml. Having now crafted an ixml
grammar that works I can look back on how fussy the hand-compilation of
that grammar turns out to be. For example, my regex-inside-XSLT approach
used two-or-more-newlines to recognize paragraph and section transitions,
but nowhere, using regex, did I have to match two-or-three-newlines or
exactly-three-newlines, etc., XSLT is amenable to pipelining, both by
itself and within XProc, so I could concentrate on matching one detail at a
time.

Your observation that the task might be better suited to regex, whether
with a sed pre-processing step, as you suggest, or within XSLT, helps me
think more generally about which tasks are well-suited to ixml and which
might be handled more effectively in other ways. I'm skeptical of my own
judgment when that question arises because situations where I have an
easier time solving a task with regex than with ixml may reflect my greater
experience with regex and XSLT than with ixml, and may have little to do
with which is the better tool for the job. I wouldn't be surprised if my
ixml grammar were more complicated than it needed to be, but I suspect that
it might not be possible to make it as simple and legible as a pipeline of
regex replacements.

Thank you again for the helpful and thought-provoking response.

Best,

David

On Mon, Jan 27, 2025 at 10:58 PM LdBeth <andpuke@foxmail.com> wrote:

> >>>>> In <
> CAP4v81pbXqTFt0G+CJj_QpFOvFHpu7A+r8aov7fFPZMSfi3UoQ@mail.gmail.com>
> >>>>>   David Birnbaum <djbpitt@gmail.com> wrote:
> > [1  <text/plain; UTF-8 (quoted-printable)>]
> > [2  <text/html; UTF-8 (quoted-printable)>]
> > Dear ixml list,
>
> > I'm using ixml to tag a plain-text novel where chapters begin with a
> > roman numeral, a dot, a space, and an upper-case title (which may
> > include spaces and a few punctuation marks), e.g.:
>
> > VI. FAKE TITLE FOR CHAPTER SIX
>
> > the issue is that my model for a line of regular narrative text
> > (which may contain all of those characters and more) overlaps with
> > it.
>
> My first intuition would be using regex based tools like sed(1) are better
> suited to this type of task for a preprocess pass, since large portions
> of the text are been ignored when identifying the header.
>
> However it is not impossible to give an ixml based solution, if you
> are actually willing to "hand compile" a grammar.
>
> Let's see a simplified case I came up just for the demonstration purpose
>
> ```input
> document: chapter+, NL* .
> chapter: title, NL, body, NL .
> title: "title", ["0"-"9"].
> body: line++NL .
> line: ["0"-"9"; "a"-"z"]* .
> -NL: -#a .
> ```
>
> A naive straight forward grammar would give ambiguous parse, as you've
> already mentioned.
>
> ```ixml
> document: chapter+, NL* .
> chapter: title, NL, body, NL .
> title: "title", ["0"-"9"].
> body: line++NL .
> line: ["0"-"9"; "a"-"z"]* .
> -NL: -#a .
> ```
>
> 2 ambigous result
>
> ```output
> <document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
>    <chapter>
>       <title>title1</title>
>       <body>
>          <line>zastext1</line>
>          <line>dtext2</line>
>       </body>
>    </chapter>
>    <chapter>
>       <title>title2</title>
>       <body>
>          <line>dsdstext1</line>
>          <line>text2asa</line>
>       </body>
>    </chapter>
> </document>
> <document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
>    <chapter>
>       <title>title1</title>
>       <body>
>          <line>zastext1</line>
>          <line>dtext2</line>
>          <line>title2</line>
>          <line>dsdstext1</line>
>          <line>text2asa</line>
>       </body>
>    </chapter>
> </document>
> ```
>
> The idea is instead of working on the grammar for title (but at least
> you'll need a correct grammar for the title), alter the definition of
> regular lines so they don't match the title. That is, to hand compile
> a PCRE negative look-ahead to ixml grammar.
>
> For this simple example we would need to define the grammar for `line`
> so it does not match `titleN`, actually, the good news is you probably
> won't need a perfect grammar for the parse to generate a unique
> result for the particular input you are matching against. You'll
> see why soon.
>
> ```ixml
> document: chapter+, NL* .
> chapter: title, NL, body, NL .
> title: "title", ["0"-"9"].
> body: line++NL .
> line: char?; (~["t"; #a], char*) ; "t", (~["i"; #a], char*)?;
>   "ti", (~["t"; #a], char*)?; "tit", (~["l"; #a], char*)? .
> -char: ["0"-"9"; "a"-"z"] .
> -NL: -#a .
> ```
>
>
>
> ```input
> title1
> zastext1
> dtext2
>
> sd
> title2
> dsdstext1
> text2asa
> ```
>
> ```output
> <document>
>    <chapter>
>       <title>title1</title>
>       <body>
>          <line>zastext1</line>
>          <line>dtext2</line>
>          <line/>
>          <line>sd</line>
>       </body>
>    </chapter>
>    <chapter>
>       <title>title2</title>
>       <body>
>          <line>dsdstext1</line>
>          <line>text2asa</line>
>       </body>
>    </chapter>
> </document>
> ```
>
> The above grammar is not error free, and I have not yet
> fully expands the definition of `line` to not much
> all possible words that share the common prefix with "title" but
> it is already good enough, and most importantly it works (TM)
>
>
> Best,
> LdBeth
>
Received on Tuesday, 28 January 2025 04:56:22 UTC