- From: LdBeth <andpuke@foxmail.com>
- Date: Mon, 27 Jan 2025 21:57:47 -0600
- To: David Birnbaum <djbpitt@gmail.com>
- Cc: ixml <public-ixml@w3.org>
- Message-ID: <tencent_0955C6DFB8DB73B79123444F32469C0E8A06@qq.com>
>>>>> In <CAP4v81pbXqTFt0G+CJj_QpFOvFHpu7A+r8aov7fFPZMSfi3UoQ@mail.gmail.com> >>>>> David Birnbaum <djbpitt@gmail.com> wrote: > [1 <text/plain; UTF-8 (quoted-printable)>] > [2 <text/html; UTF-8 (quoted-printable)>] > Dear ixml list, > I'm using ixml to tag a plain-text novel where chapters begin with a > roman numeral, a dot, a space, and an upper-case title (which may > include spaces and a few punctuation marks), e.g.: > VI. FAKE TITLE FOR CHAPTER SIX > the issue is that my model for a line of regular narrative text > (which may contain all of those characters and more) overlaps with > it. My first intuition would be using regex based tools like sed(1) are better suited to this type of task for a preprocess pass, since large portions of the text are been ignored when identifying the header. However it is not impossible to give an ixml based solution, if you are actually willing to "hand compile" a grammar. Let's see a simplified case I came up just for the demonstration purpose ```input document: chapter+, NL* . chapter: title, NL, body, NL . title: "title", ["0"-"9"]. body: line++NL . line: ["0"-"9"; "a"-"z"]* . -NL: -#a . ``` A naive straight forward grammar would give ambiguous parse, as you've already mentioned. ```ixml document: chapter+, NL* . chapter: title, NL, body, NL . title: "title", ["0"-"9"]. body: line++NL . line: ["0"-"9"; "a"-"z"]* . -NL: -#a . ``` 2 ambigous result ```output <document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous"> <chapter> <title>title1</title> <body> <line>zastext1</line> <line>dtext2</line> </body> </chapter> <chapter> <title>title2</title> <body> <line>dsdstext1</line> <line>text2asa</line> </body> </chapter> </document> <document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous"> <chapter> <title>title1</title> <body> <line>zastext1</line> <line>dtext2</line> <line>title2</line> <line>dsdstext1</line> <line>text2asa</line> </body> </chapter> </document> ``` The idea is instead of working on the grammar for title (but at least you'll need a correct grammar for the title), alter the definition of regular lines so they don't match the title. That is, to hand compile a PCRE negative look-ahead to ixml grammar. For this simple example we would need to define the grammar for `line` so it does not match `titleN`, actually, the good news is you probably won't need a perfect grammar for the parse to generate a unique result for the particular input you are matching against. You'll see why soon. ```ixml document: chapter+, NL* . chapter: title, NL, body, NL . title: "title", ["0"-"9"]. body: line++NL . line: char?; (~["t"; #a], char*) ; "t", (~["i"; #a], char*)?; "ti", (~["t"; #a], char*)?; "tit", (~["l"; #a], char*)? . -char: ["0"-"9"; "a"-"z"] . -NL: -#a . ``` ```input title1 zastext1 dtext2 sd title2 dsdstext1 text2asa ``` ```output <document> <chapter> <title>title1</title> <body> <line>zastext1</line> <line>dtext2</line> <line/> <line>sd</line> </body> </chapter> <chapter> <title>title2</title> <body> <line>dsdstext1</line> <line>text2asa</line> </body> </chapter> </document> ``` The above grammar is not error free, and I have not yet fully expands the definition of `line` to not much all possible words that share the common prefix with "title" but it is already good enough, and most importantly it works (TM) Best, LdBeth
Received on Tuesday, 28 January 2025 04:03:18 UTC