Re: [invisibleXML/ixml] https://github.com/invisibleXML/ixml/blob/master/samples/URI/rfc-3987.ixml (Issue #139) from John Lumley on 2022-08-24 (public-ixml@w3.org from August 2022)

From: John Lumley <john@saxonica.com>
Date: Wed, 24 Aug 2022 10:17:18 +0100
To: public-ixml@w3.org
Message-ID: <ddc96002-966e-7435-0e76-d0936a415d7f@saxonica.com>

On 13/08/2022 11:46, Steven Pemberton wrote:
> I think one principle of ixml supplied grammars should be: you don't 
> need to reparse any subtrees.

This is I think a point worth some emphasis or discussion.

You could write an ixml grammar that:

  * just acts as a 'correct recogniser' of grammatical sentences, with
    the side effect of some parse tree, or
  * endeavours to parse as much of the syntactic/semantic structure as
    you can and return as complete a syntax tree as possible.

or of course anywhere on the spectrum in between.

If you find you have to do a lot more 'strutural analysis' on the 
resulting XML tree then perhaps you haven't used the 'power' of ixml 
descriptions enough. On the other hand you might find that trying to 
squeeze to grab the last parts of structure makes the ixml unwieldy when 
something simpler could happen on an XPath/XSLT downstream process. And 
we might expect any downstream processing to have to sometimes at least 
do some checking of the XML that might be impossible to code efficiently 
in ixml

The discussion on the IPV6 grammars gives examples of the possibilities 
but as a very simple example consider the recent bug, where the 
character class 'LC' wasn't recognised by the 1.0 grammar. The proposed 
change:

-class: code.
          @code: capital, letter?.
       -capital: ["A"-"Z"].
        -letter: ["a"-"z";"A"-"Z"].

certainly is the simplest solution, but it isn't the most exact, since 
as far as we can see 'LC' is the /only /uppercase letter pair permitted 
so far, so we could in theory make the code production consider 'LC' one 
of the options, viz:

-class: code.
          @code: "LC" ; capital, letter?.
       -capital: ["A"-"Z"].
        -letter: ["a"-"z"].

which is of course more precise, but puts further load on the ixml 
process. And in any case we have to check downstream that the Unicode 
class code, held in the @code attribute, is a valid one anyway, unless 
our grammar denotes each and every possible code, which would of course 
make the grammar totally unwieldy.

-- 
*John Lumley* MA PhD CEng FIEE
john@saxonica.com

Received on Wednesday, 24 August 2022 09:17:42 UTC