- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Mon, 3 Jan 2022 08:57:07 -0700
- To: Dave Pawson <dave.pawson@gmail.com>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, Steven Pemberton <steven.pemberton@cwi.nl>, ixml <public-ixml@w3.org>
> On 3,Jan2022, at 3:27 AM, Dave Pawson <dave.pawson@gmail.com> wrote: > > On Mon, 3 Jan 2022 at 10:23, Steven Pemberton <steven.pemberton@cwi.nl> wrote: >> >> >> >> On Monday 03 January 2022 11:19:25 (+01:00), Dave Pawson wrote: >> >>> On Mon, 3 Jan 2022 at 10:04, Steven Pemberton <steven.pemberton@cwi.nl> >> wrote: >>>> Output, on the other hand, is a different issue, because of the XML >>>> misfeature of excluding most C0 characters from content. (If I were >>>> redesigning XML, I would allow those characters, but only expressed in >>>> encoded form.) >>>> >>>> So, it is just fine to accept XML illegal characters in the input, as >> long >>>> as they don't end up in the output: > > Then what are you saying above? > I provide C0 char in, "it doesn't end up in the output" > IMHO that is modifying my data as given to the application? Let me try phrasing it differently. As specified, ixml maps input into XML. The details of the mapping are specified in the grammar, so the precise form of the output depends both on the input and the output. If the input were different, the output might be different; if the grammar were different, the output might be different. Output depends on input + grammar. So far, that will I hope be non-controversial. It is a consequence of the way XML is specified that character U+0001, among others, cannot appear in any XML 1.0 or XML 1.1 document, and cannot be referred to in any XML 1.0 document. That, in turn, means that any attempt to include that character in the XML output of an XML processor is doomed to failure. So the mapping from non-XML data to XML data cannot always succeed. What happens when it doesn’t? Is it OK? Is it an error in the processor? Is it an error in the grammar? Is it an error in the input? It doesn’t fit the short description of ixml, because we had input and a grammar and we did not get XML out at the other end. The spec needs a story of some kind. What should that story be? The position Steven is suggesting is (as I understand it): - Input is allowed to contain any Unicode character. - In order to describe the input, grammars may refer to (or contain) any Unicode character. - In order to ensure that the output is in fact XML, the grammar must see to it that any non-XML characters in the input do not get written out as data in an XML document. The obvious way to do this is to mark the relevant terminals as hidden, as in Steven’s example -[#0 - #1F] There may be other ways to write the grammar so as to ensure that a U+0001 in the input does not end up making it impossible for the processor to produce XML output, although I cannot think of any off hand. (If we had a way to replace a character with its hex code, I could write a grammar to write out a U+0001 character as \u0001 or &my-ncr-0001; or <?hex 0001?> or something similar, using a non-standard method of escaping that character in an XML context (because there is no standard way). But we don’t have that in ixml now and no one has suggested it.) - Steven’s remark "And assuring those characters don't get through to the output is the grammar author's responsibility” leads to a story in which an attempt to write out a non-XML character in ixml output is an error in the grammar. Possibly, like other cases that have been brought up, it’s what I would call a “run-time error in the grammar” — that is, an error in the grammar that may be caught only for some inputs, and which a processor is not obligated to detect in other cases. It might be nicer to require the processor to detect the error regardless of the input, but it might be very tricky to analyse a grammar and prove that no possible input would ever cause an attempt to write a non-XML character to the output. I would not swear that there is not a theorem proving that it cannot be done, or that it’s equivalent to the Halting Problem. All I know is that it doesn’t look easy. So: Steven is not proposing that input containing U+0001 be illegal, nor that it be modified silently to change the character to something else. He is observing that the grammar writer already has the responsibility of saying what parts of the input get written out to the XML output and is thus in a position to write a grammar that ensures that non-XML characters do not appear in the output. Those things could of course be proposed — you did propose, if I understood you correctly, that ixml just specify that all inputs have to be streams of XML characters, and I think that would make life simpler for me as an implementor. No one that I know of has proposed that non-XML characters in the input be legal but silently changed to something else. I think the idea that a processor might modify the input may have come from my musings about what my XDM-based processor might do with a range like [#1 - #7e]. I could implement such a range by providing a function that turns the input character into an integer and compares that integer to the numbers 1 and 126, and signals a match if 1 <= character-number <= 126. Or I could implement such a range by checking the input character against the XPath regular expression [	-~], which on the face of it does not mean the same thing, but which is guaranteed to produce the same result on every test that can be presented to my code. Since I am working on XML 1.0 strings, I know in advance that character U+0001 does not and cannot occur in my input, so I do not need to find a way to write an XPath regular expression that deals with that character; if I translate an ixml inclusion or exclusion into an XPath regular expression, the requirement is that the XPath regex have the correct behavior on all possible inputs. It is not required that it have correct behavior on impossible inputs. I hope this helps. Michael
Received on Monday, 3 January 2022 15:57:29 UTC