- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Wed, 22 Dec 2021 12:07:02 -0700
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
On 22,Dec2021, at 7:43 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote: > I think this should be our philosophical position: > ixml provides a way for the author to convert non-XML documents into > XML. It is up to the author to write ixml so that it produces > correct XML, and therefore to ensure that: > * serialised names are correct XML names, > * attribute and element content do not contain illegal > characters, > * any element does not have more than one attribute of a given name > and not worry more about these issues within the ixml definition. > As to classifications, in my recent mail on this I proposed a 5-way > split > 1. ixml grammar syntax errors > 2. ixml grammar semantic errors > 3. ixml grammar correct, input and grammar don't match > 4. ixml grammar correct, input is ambiguous > 5. ixml grammar correct, test completes correctly > What you are proposing I think is a sixth: > 6. ixml grammar correct, test completes correctly, resulting XML is > in error as a result of authoring errors. My answer has grown into two different messages: one about test suites and one about parsing outcomes and errors. This is the one about outcomes. I think your five-way split makes sense as a rough classification of outcomes, though I don't understand your second item and I think that items 4 and 5 belong together. The test case tests/expr1.* may lead us to postulate a sixth kind of outcome, but I don't think I understand the problem well enough to make such a proposal now, and your description worries me a bit. My first concern is with the words "test completes correctly". I am not confident that the test will complete at all in my processor, since the well-formedness errors may well cause a run-time exception. (I haven't yet tried it, so I don't know.) My second concern is that I don't know what 'correctly' might mean here. Third, I don't know who the 'author' is who has committed authoring errors. If the "author" here is the writer of the input, then this sounds like saying that if the input to expr1.ixml produces non-well-formed output, then the input is not as expected, and the correct specification of the expected result is to say that the input is not a sentence in the language defined by the grammar. But the input does conform to the grammar as written, interpreted solely as a context-free grammar: it's just the annotations for XML serialization that cause the problem. If 'plusop' were marked ^ rather than @, there would be no problem serializing the result. On balance, I don't think this is the right analysis. If the "author" you have in mind is the writer of the ixml grammar, then this sounds like saying that a grammar that produces non-well-formed output is a faulty grammar. If so, then I think the correct specification of the expected result is to say that expr1.ixml is not a conforming grammar. That analysis does have a wrinkle. If the file tests/expr1.ixml is not a conforming ixml grammar, then a conforming processor is currently required to reject it, even though on some inputs (e.g. "1+2") the problem will not be visible. So we seem to have a choice: - We can say that conforming processors must detect the problem here, even if the input does not exercise it. or - We can say that conforming processors are not required after all to detect and report errors in grammars. The same choice arises in connection with checking whether nonterminals are legal names, though we didn't notice when we were discussing that topic. Perhaps we do need another category. I think the philosophical position you outline here, and what you have proposed in connection with nonterminals and XML names, amount to distinguishing two classes of errors: (1) errors in grammars we can check for and detect, independently of any input string, and (2) errors that are only detectable (or only readily detectable) given both grammar and input. I'll call these static and dynamic errors for short. Since static errors are detectable by examining grammars in isolation, it's plausible to call them errors in the grammar. Since dynamic errors are not required to be detected (or possibly not detectable) by examining grammars in isolation, we may or may not call them errors in the grammar, or errors in the input, or something else. They are perhaps errors in the grammar: they are failures of the grammar writer to guarantee that the markings specify well-formed output for all grammatical inputs. (But we require processors to reject non-conforming grammars: if non-conformance is not statically detectable, that requirement is impossible to satisfy.) They are perhaps errors in the input. Only, 'error' in a spec like ours usually means that something is non-conforming. We have conformance rules for grammars and for processors; input is not something that can conform or fail to conform to our spec. Or perhaps they are best described as dynamic errors or dynamic exceptions, without pointing either at the grammar or at the input. They are situations that can arise while parsing input against a conforming grammar. If we do need another category of outcome from a parsing run or test case (I'm still thinking), I am tempted to say: - Such a run-time exception is not an error in the grammar; conforming grammars can have run-time exceptions on some inputs. (So processors are not required to detect and reject grammars that could have run-time exceptions for some inputs.) - Such a run-time exception is not a sign that the input is not grammatical. - Conforming processors are required to report such run-time exceptions and MAY recover from them. (In the case of expr1.*, recovery might take the form of choosing any one of the 'plusOp' attribute-value specifications to include and discarding the others, or it might take the form of ignoring the @ marking on plusOp and serializing it as an element.) Michael
Received on Wednesday, 22 December 2021 19:07:23 UTC