- From: David Birnbaum <djbpitt@gmail.com>
- Date: Mon, 27 Jan 2025 23:56:06 -0500
- To: LdBeth <andpuke@foxmail.com>
- Cc: ixml <public-ixml@w3.org>
- Message-ID: <CAP4v81qFrmq8gtTw4aOWsND71QBg4pk1NhTXKi1h3-cbsBxhUg@mail.gmail.com>
Dear LdBeth (cc public-ixml), Thank you for your generous explanation! Although I was able to grapple my way to Something That Worked by following Bethan's good advice (on this list) and relying on a combination of fussy newline-counting and roman numeral patterns, your response drew my attention to two issues that I had considered with insufficient care: 1. I gave up too early on crafting a pattern for a regular text line that would exclude chapter titles. Your example helped me notice a regularity that is obvious and self-explanatory in retrospect: a normal text line never has a dot or an upper-case letter in second position except when the line begins with a quotation mark, while chapter title lines and the subsection title line always have a dot or an upper-case letter in second position. That's enough to enable the sort of exclusion approach you describe. This reason was obvious once I had noticed it; the regular text has no words in all caps and no one-letter words that can appear at the end of a sentence, so an upper-case letter can be in second position in a line only when preceding by a quotation mark (the beginning of a quoted sentence, and therefore not a chapter title) or when part of a title. Once I incorporate the not-a-quotation-mark into a pattern, it isn't difficult to write, even if it may not be very easy to read and understand without an explanatory comment. 2. A regex approach is easier for me because I have a lot of experience with using regex, including inside XSLT, to add markup to plain text, while I'm still pretty much a beginner with ixml. Having now crafted an ixml grammar that works I can look back on how fussy the hand-compilation of that grammar turns out to be. For example, my regex-inside-XSLT approach used two-or-more-newlines to recognize paragraph and section transitions, but nowhere, using regex, did I have to match two-or-three-newlines or exactly-three-newlines, etc., XSLT is amenable to pipelining, both by itself and within XProc, so I could concentrate on matching one detail at a time. Your observation that the task might be better suited to regex, whether with a sed pre-processing step, as you suggest, or within XSLT, helps me think more generally about which tasks are well-suited to ixml and which might be handled more effectively in other ways. I'm skeptical of my own judgment when that question arises because situations where I have an easier time solving a task with regex than with ixml may reflect my greater experience with regex and XSLT than with ixml, and may have little to do with which is the better tool for the job. I wouldn't be surprised if my ixml grammar were more complicated than it needed to be, but I suspect that it might not be possible to make it as simple and legible as a pipeline of regex replacements. Thank you again for the helpful and thought-provoking response. Best, David On Mon, Jan 27, 2025 at 10:58 PM LdBeth <andpuke@foxmail.com> wrote: > >>>>> In < > CAP4v81pbXqTFt0G+CJj_QpFOvFHpu7A+r8aov7fFPZMSfi3UoQ@mail.gmail.com> > >>>>> David Birnbaum <djbpitt@gmail.com> wrote: > > [1 <text/plain; UTF-8 (quoted-printable)>] > > [2 <text/html; UTF-8 (quoted-printable)>] > > Dear ixml list, > > > I'm using ixml to tag a plain-text novel where chapters begin with a > > roman numeral, a dot, a space, and an upper-case title (which may > > include spaces and a few punctuation marks), e.g.: > > > VI. FAKE TITLE FOR CHAPTER SIX > > > the issue is that my model for a line of regular narrative text > > (which may contain all of those characters and more) overlaps with > > it. > > My first intuition would be using regex based tools like sed(1) are better > suited to this type of task for a preprocess pass, since large portions > of the text are been ignored when identifying the header. > > However it is not impossible to give an ixml based solution, if you > are actually willing to "hand compile" a grammar. > > Let's see a simplified case I came up just for the demonstration purpose > > ```input > document: chapter+, NL* . > chapter: title, NL, body, NL . > title: "title", ["0"-"9"]. > body: line++NL . > line: ["0"-"9"; "a"-"z"]* . > -NL: -#a . > ``` > > A naive straight forward grammar would give ambiguous parse, as you've > already mentioned. > > ```ixml > document: chapter+, NL* . > chapter: title, NL, body, NL . > title: "title", ["0"-"9"]. > body: line++NL . > line: ["0"-"9"; "a"-"z"]* . > -NL: -#a . > ``` > > 2 ambigous result > > ```output > <document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous"> > <chapter> > <title>title1</title> > <body> > <line>zastext1</line> > <line>dtext2</line> > </body> > </chapter> > <chapter> > <title>title2</title> > <body> > <line>dsdstext1</line> > <line>text2asa</line> > </body> > </chapter> > </document> > <document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous"> > <chapter> > <title>title1</title> > <body> > <line>zastext1</line> > <line>dtext2</line> > <line>title2</line> > <line>dsdstext1</line> > <line>text2asa</line> > </body> > </chapter> > </document> > ``` > > The idea is instead of working on the grammar for title (but at least > you'll need a correct grammar for the title), alter the definition of > regular lines so they don't match the title. That is, to hand compile > a PCRE negative look-ahead to ixml grammar. > > For this simple example we would need to define the grammar for `line` > so it does not match `titleN`, actually, the good news is you probably > won't need a perfect grammar for the parse to generate a unique > result for the particular input you are matching against. You'll > see why soon. > > ```ixml > document: chapter+, NL* . > chapter: title, NL, body, NL . > title: "title", ["0"-"9"]. > body: line++NL . > line: char?; (~["t"; #a], char*) ; "t", (~["i"; #a], char*)?; > "ti", (~["t"; #a], char*)?; "tit", (~["l"; #a], char*)? . > -char: ["0"-"9"; "a"-"z"] . > -NL: -#a . > ``` > > > > ```input > title1 > zastext1 > dtext2 > > sd > title2 > dsdstext1 > text2asa > ``` > > ```output > <document> > <chapter> > <title>title1</title> > <body> > <line>zastext1</line> > <line>dtext2</line> > <line/> > <line>sd</line> > </body> > </chapter> > <chapter> > <title>title2</title> > <body> > <line>dsdstext1</line> > <line>text2asa</line> > </body> > </chapter> > </document> > ``` > > The above grammar is not error free, and I have not yet > fully expands the definition of `line` to not much > all possible words that share the common prefix with "title" but > it is already good enough, and most importantly it works (TM) > > > Best, > LdBeth >
Received on Tuesday, 28 January 2025 04:56:22 UTC