Re: ixml and non-xml output - is there an error and if so where?

On 22,Dec2021, at 7:43 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:

> I think this should be our philosophical position:

> ixml provides a way for the author to convert non-XML documents into
> XML.  It is up to the author to write ixml so that it produces
> correct XML, and therefore to ensure that:

>     * serialised names are correct XML names,
>     * attribute and element content do not contain illegal
>       characters,
>     * any element does not have more than one attribute of a given name
>       and not worry more about these issues within the ixml definition.

> As to classifications, in my recent mail on this I proposed a 5-way
> split

>     1. ixml grammar syntax errors
>     2. ixml grammar semantic errors
>     3. ixml grammar correct, input and grammar don't match
>     4. ixml grammar correct, input is ambiguous
>     5. ixml grammar correct, test completes correctly

> What you are proposing I think is a sixth:

>     6. ixml grammar correct, test completes correctly, resulting XML is
>        in error as a result of authoring errors.

My answer has grown into two different messages: one about test suites
and one about parsing outcomes and errors.  This is the one about
outcomes.

I think your five-way split makes sense as a rough classification of
outcomes, though I don't understand your second item and I think that
items 4 and 5 belong together.

The test case tests/expr1.* may lead us to postulate a sixth kind of
outcome, but I don't think I understand the problem well enough to
make such a proposal now, and your description worries me a bit.

My first concern is with the words "test completes correctly".  I am
not confident that the test will complete at all in my processor,
since the well-formedness errors may well cause a run-time exception.
(I haven't yet tried it, so I don't know.)

My second concern is that I don't know what 'correctly' might mean
here.

Third, I don't know who the 'author' is who has committed authoring
errors.

If the "author" here is the writer of the input, then this sounds like
saying that if the input to expr1.ixml produces non-well-formed
output, then the input is not as expected, and the correct
specification of the expected result is to say that the input is not a
sentence in the language defined by the grammar.

But the input does conform to the grammar as written, interpreted
solely as a context-free grammar: it's just the annotations for XML
serialization that cause the problem.  If 'plusop' were marked ^
rather than @, there would be no problem serializing the result.  On
balance, I don't think this is the right analysis.

If the "author" you have in mind is the writer of the ixml grammar,
then this sounds like saying that a grammar that produces
non-well-formed output is a faulty grammar.  If so, then I think the
correct specification of the expected result is to say that expr1.ixml
is not a conforming grammar.

That analysis does have a wrinkle.  If the file tests/expr1.ixml is
not a conforming ixml grammar, then a conforming processor is
currently required to reject it, even though on some inputs
(e.g. "1+2") the problem will not be visible.  So we seem to have a
choice:

  - We can say that conforming processors must detect the problem
  here, even if the input does not exercise it.

or

  - We can say that conforming processors are not required after all
  to detect and report errors in grammars.

The same choice arises in connection with checking whether
nonterminals are legal names, though we didn't notice when we were
discussing that topic.


Perhaps we do need another category.

I think the philosophical position you outline here, and what you have
proposed in connection with nonterminals and XML names, amount to
distinguishing two classes of errors:

(1) errors in grammars we can check for and detect, independently of
any input string, and

(2) errors that are only detectable (or only readily detectable) given
both grammar and input.

I'll call these static and dynamic errors for short.

Since static errors are detectable by examining grammars in isolation, 
it's plausible to call them errors in the grammar. 

Since dynamic errors are not required to be detected (or possibly not
detectable) by examining grammars in isolation, we may or may not call
them errors in the grammar, or errors in the input, or something else.

They are perhaps errors in the grammar: they are failures of the
grammar writer to guarantee that the markings specify well-formed
output for all grammatical inputs.  (But we require processors to
reject non-conforming grammars: if non-conformance is not statically
detectable, that requirement is impossible to satisfy.)

They are perhaps errors in the input.  Only, 'error' in a spec like
ours usually means that something is non-conforming.  We have
conformance rules for grammars and for processors; input is not
something that can conform or fail to conform to our spec.

Or perhaps they are best described as dynamic errors or dynamic
exceptions, without pointing either at the grammar or at the input.
They are situations that can arise while parsing input against a
conforming grammar.

If we do need another category of outcome from a parsing run or test
case (I'm still thinking), I am tempted to say:

- Such a run-time exception is not an error in the grammar; conforming
  grammars can have run-time exceptions on some inputs.  (So
  processors are not required to detect and reject grammars that could
  have run-time exceptions for some inputs.)

- Such a run-time exception is not a sign that the input is not
  grammatical.

- Conforming processors are required to report such run-time
  exceptions and MAY recover from them.  (In the case of expr1.*,
  recovery might take the form of choosing any one of the 'plusOp'
  attribute-value specifications to include and discarding the others,
  or it might take the form of ignoring the @ marking on plusOp and
  serializing it as an element.)

Michael

Received on Wednesday, 22 December 2021 19:07:23 UTC