- From: John Lumley <john@saxonica.com>
- Date: Wed, 24 Aug 2022 10:17:18 +0100
- To: public-ixml@w3.org
- Message-ID: <ddc96002-966e-7435-0e76-d0936a415d7f@saxonica.com>
On 13/08/2022 11:46, Steven Pemberton wrote:
> I think one principle of ixml supplied grammars should be: you don't
> need to reparse any subtrees.
This is I think a point worth some emphasis or discussion.
You could write an ixml grammar that:
* just acts as a 'correct recogniser' of grammatical sentences, with
the side effect of some parse tree, or
* endeavours to parse as much of the syntactic/semantic structure as
you can and return as complete a syntax tree as possible.
or of course anywhere on the spectrum in between.
If you find you have to do a lot more 'strutural analysis' on the
resulting XML tree then perhaps you haven't used the 'power' of ixml
descriptions enough. On the other hand you might find that trying to
squeeze to grab the last parts of structure makes the ixml unwieldy when
something simpler could happen on an XPath/XSLT downstream process. And
we might expect any downstream processing to have to sometimes at least
do some checking of the XML that might be impossible to code efficiently
in ixml
The discussion on the IPV6 grammars gives examples of the possibilities
but as a very simple example consider the recent bug, where the
character class 'LC' wasn't recognised by the 1.0 grammar. The proposed
change:
-class: code.
@code: capital, letter?.
-capital: ["A"-"Z"].
-letter: ["a"-"z";"A"-"Z"].
certainly is the simplest solution, but it isn't the most exact, since
as far as we can see 'LC' is the /only /uppercase letter pair permitted
so far, so we could in theory make the code production consider 'LC' one
of the options, viz:
-class: code.
@code: "LC" ; capital, letter?.
-capital: ["A"-"Z"].
-letter: ["a"-"z"].
which is of course more precise, but puts further load on the ixml
process. And in any case we have to check downstream that the Unicode
class code, held in the @code attribute, is a valid one anyway, unless
our grammar denotes each and every possible code, which would of course
make the grammar totally unwieldy.
--
*John Lumley* MA PhD CEng FIEE
john@saxonica.com
Received on Wednesday, 24 August 2022 09:17:42 UTC