Re: Ambiguity (what else!?) question from LdBeth on 2025-01-28 (public-ixml@w3.org from January 2025)

From: LdBeth <andpuke@foxmail.com>
Date: Mon, 27 Jan 2025 21:57:47 -0600
To: David Birnbaum <djbpitt@gmail.com>
Cc: ixml <public-ixml@w3.org>
Message-ID: <tencent_0955C6DFB8DB73B79123444F32469C0E8A06@qq.com>
>>>>> In <CAP4v81pbXqTFt0G+CJj_QpFOvFHpu7A+r8aov7fFPZMSfi3UoQ@mail.gmail.com>
>>>>>	David Birnbaum <djbpitt@gmail.com> wrote:
> [1  <text/plain; UTF-8 (quoted-printable)>]
> [2  <text/html; UTF-8 (quoted-printable)>]
> Dear ixml list,

> I'm using ixml to tag a plain-text novel where chapters begin with a
> roman numeral, a dot, a space, and an upper-case title (which may
> include spaces and a few punctuation marks), e.g.:

> VI. FAKE TITLE FOR CHAPTER SIX

> the issue is that my model for a line of regular narrative text
> (which may contain all of those characters and more) overlaps with
> it.

My first intuition would be using regex based tools like sed(1) are better
suited to this type of task for a preprocess pass, since large portions
of the text are been ignored when identifying the header.

However it is not impossible to give an ixml based solution, if you
are actually willing to "hand compile" a grammar.

Let's see a simplified case I came up just for the demonstration purpose

```input
document: chapter+, NL* .
chapter: title, NL, body, NL .
title: "title", ["0"-"9"].
body: line++NL .
line: ["0"-"9"; "a"-"z"]* .
-NL: -#a .
```

A naive straight forward grammar would give ambiguous parse, as you've
already mentioned.

```ixml
document: chapter+, NL* .
chapter: title, NL, body, NL .
title: "title", ["0"-"9"].
body: line++NL .
line: ["0"-"9"; "a"-"z"]* .
-NL: -#a .
```

2 ambigous result

```output
<document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <chapter>
      <title>title1</title>
      <body>
         <line>zastext1</line>
         <line>dtext2</line>
      </body>
   </chapter>
   <chapter>
      <title>title2</title>
      <body>
         <line>dsdstext1</line>
         <line>text2asa</line>
      </body>
   </chapter>
</document>
<document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <chapter>
      <title>title1</title>
      <body>
         <line>zastext1</line>
         <line>dtext2</line>
         <line>title2</line>
         <line>dsdstext1</line>
         <line>text2asa</line>
      </body>
   </chapter>
</document>
```

The idea is instead of working on the grammar for title (but at least
you'll need a correct grammar for the title), alter the definition of
regular lines so they don't match the title. That is, to hand compile
a PCRE negative look-ahead to ixml grammar.

For this simple example we would need to define the grammar for `line`
so it does not match `titleN`, actually, the good news is you probably
won't need a perfect grammar for the parse to generate a unique
result for the particular input you are matching against. You'll
see why soon.

```ixml
document: chapter+, NL* .
chapter: title, NL, body, NL .
title: "title", ["0"-"9"].
body: line++NL .
line: char?; (~["t"; #a], char*) ; "t", (~["i"; #a], char*)?;
  "ti", (~["t"; #a], char*)?; "tit", (~["l"; #a], char*)? .
-char: ["0"-"9"; "a"-"z"] .
-NL: -#a .
```



```input
title1
zastext1
dtext2

sd
title2
dsdstext1
text2asa
```

```output
<document>
   <chapter>
      <title>title1</title>
      <body>
         <line>zastext1</line>
         <line>dtext2</line>
         <line/>
         <line>sd</line>
      </body>
   </chapter>
   <chapter>
      <title>title2</title>
      <body>
         <line>dsdstext1</line>
         <line>text2asa</line>
      </body>
   </chapter>
</document>
```

The above grammar is not error free, and I have not yet
fully expands the definition of `line` to not much
all possible words that share the common prefix with "title" but
it is already good enough, and most importantly it works (TM)


Best,
LdBeth
Received on Tuesday, 28 January 2025 04:03:18 UTC