- From: John Lumley <john@saxonica.com>
- Date: Wed, 24 Aug 2022 10:17:18 +0100
- To: public-ixml@w3.org
- Message-ID: <ddc96002-966e-7435-0e76-d0936a415d7f@saxonica.com>
On 13/08/2022 11:46, Steven Pemberton wrote: > I think one principle of ixml supplied grammars should be: you don't > need to reparse any subtrees. This is I think a point worth some emphasis or discussion. You could write an ixml grammar that: * just acts as a 'correct recogniser' of grammatical sentences, with the side effect of some parse tree, or * endeavours to parse as much of the syntactic/semantic structure as you can and return as complete a syntax tree as possible. or of course anywhere on the spectrum in between. If you find you have to do a lot more 'strutural analysis' on the resulting XML tree then perhaps you haven't used the 'power' of ixml descriptions enough. On the other hand you might find that trying to squeeze to grab the last parts of structure makes the ixml unwieldy when something simpler could happen on an XPath/XSLT downstream process. And we might expect any downstream processing to have to sometimes at least do some checking of the XML that might be impossible to code efficiently in ixml The discussion on the IPV6 grammars gives examples of the possibilities but as a very simple example consider the recent bug, where the character class 'LC' wasn't recognised by the 1.0 grammar. The proposed change: -class: code. @code: capital, letter?. -capital: ["A"-"Z"]. -letter: ["a"-"z";"A"-"Z"]. certainly is the simplest solution, but it isn't the most exact, since as far as we can see 'LC' is the /only /uppercase letter pair permitted so far, so we could in theory make the code production consider 'LC' one of the options, viz: -class: code. @code: "LC" ; capital, letter?. -capital: ["A"-"Z"]. -letter: ["a"-"z"]. which is of course more precise, but puts further load on the ixml process. And in any case we have to check downstream that the Unicode class code, held in the @code attribute, is a valid one anyway, unless our grammar denotes each and every possible code, which would of course make the grammar totally unwieldy. -- *John Lumley* MA PhD CEng FIEE john@saxonica.com
Received on Wednesday, 24 August 2022 09:17:42 UTC