NineML version 3.2.9 published from Norm Tovey-Walsh on 2025-03-03 (public-ixml@w3.org from March 2025)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Mon, 03 Mar 2025 17:13:54 +0000
To: ixml <public-ixml@w3.org>
Message-ID: <m2mse2f359.fsf@saxonica.com>
Hi folks,

I published NineML version 3.2.9 this afternoon.

  https://docs.nineml.org/current/changelog.html

(I tried, but failed to publish 3.2.8 a week or so ago, I forgot to push the tag *blush*, so this is actually the first release since 3.2.7.)

There are some bug fixes in here, but possibly the most interesting thing is my attempt to produce better debugging information when a parse fails. I haven’t properly documented it yet but a quick example might be illustrative.

Consider this grammar for “sentences”:

sentence = word++ws, punct .
-ws = -" "+ .
-uc = ["A"-"Z"] .
-lc = ["a"-"z"] .
word = (uc, lc*) | lc+ .
-punct = ["." | "!" | "?" ] .

And this input: This is A TEST.

That’s not a sentence because words can only begin with a capital letter. Previous versions of NineML produced an error document like this one:

<fail xmlns:ixml='http://invisiblexml.org/NS' ixml:state='failed'>
   <line>1</line>
   <column>12</column>
   <pos>12</pos>
   <unexpected>E</unexpected>
   <permitted>' ', ['.'; '!'; '?'], ['a'-'z']</permitted>
</fail>

Which is useful as far as it goes, but there are lots of situations in more complicated grammars where that’s only a very vague pointer in the right direction.

Starting with 3.2.9, the failure document contains three new sections:

   <completions>
      <completed start='1' end='1' rules='uc, word'>
         <input>T</input>
      </completed>
      <completed start='2' end='2' rules='lc'>
         <input>h</input>
      </completed>
      …
   </completions>

This gives you a sense of what the parser had successfully parsed. So the initial “T” satisfies both “uc” and “word”. The following “h” satisfies lc, etc.

Next there’s an attempt to make the “permitted” list more useful:

   <could-be-next>
      <in rule='lc'>
         <tokens>['a'-'z']</tokens>
      </in>
      <in rule='punct'>
         <tokens>['.'; '!'; '?']</tokens>
      </in>
   </could-be-next>

This indicates that a-z is allowed next by the “lc” rule and the punctuation symbols are allowed next by the “punct” rule. (Why space isn’t included in this list is a bit of a mystery; I’ll have to investigate when I have a moment.)

Finally, there’s a list of what rules where open:

   <unfinished>
      <open start='1' end='1' rules='word'>
         <input>T</input>
      </open>
      <open start='1' end='11' rules='sentence'>
         <input>This is A T</input>
      </open>
   </unfinished>

You’d think the word that started at 1 was finished, but there’s a bunch of heuristics going on to try to reduce the number of rules presented. I’ll be looking into that too.

Caveat: this can only show you what the parser was doing when it gave up. That can be misleading, the parser doesn’t know *why* the input isn’t correct, it just knows that it was running all these effectively parallel parsing attempts and they all reached dead ends. Maybe that’s because the last character attempted needed to be a “-” instead of a “.” or maybe it has to be fixed 1,000 characters and 87 rules earlier. Only you can tell.

This output will almost certainly change over the next few releases as I figure out heuristics that work better. Comments and feedback most welcome.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica
Received on Monday, 3 March 2025 17:14:02 UTC