- From: Dave Pawson <dave.pawson@gmail.com>
- Date: Mon, 3 Jan 2022 16:18:14 +0000
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Cc: Steven Pemberton <steven.pemberton@cwi.nl>, ixml <public-ixml@w3.org>
<grin/> I note your implementers view Michael! Picking out the bits of interest to me. Hope you can map those to your perspective. On Mon, 3 Jan 2022 at 15:57, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote: > > > > > On 3,Jan2022, at 3:27 AM, Dave Pawson <dave.pawson@gmail.com> wrote: > > > > On Mon, 3 Jan 2022 at 10:23, Steven Pemberton <steven.pemberton@cwi.nl> wrote: > >> > Let me try phrasing it differently. > > As specified, ixml maps input into XML. The details of the mapping > are specified in the grammar, so the precise form of the output depends > both on the input and the output. If the input were different, the output > might be different; if the grammar were different, the output might be > different. Output depends on input + grammar. So far, that will I hope > be non-controversial. To me, there is an implicit 'what you gave me, I pass through to output' there, but I may be wrong. > > It is a consequence of the way XML is specified that character U+0001, > among others, cannot appear in any XML 1.0 or XML 1.1 document, > and cannot be referred to in any XML 1.0 document. That, in turn, > means that any attempt to include that character in the XML output > of an XML processor is doomed to failure. Agreed. > > So the mapping from non-XML data to XML data cannot always > succeed. What happens when it doesn’t? Is it OK? Is it an error > in the processor? Is it an error in the grammar? Is it an error in the > input? It doesn’t fit the short description of ixml, because we had > input and a grammar and we did not get XML out at the other end. > The spec needs a story of some kind. What should that story be? > > The position Steven is suggesting is (as I understand it): > > - Input is allowed to contain any Unicode character. A position with which I think is in error. > > - In order to describe the input, grammars may refer to (or contain) > any Unicode character. Again, I disagree and believe the spec should say so. > > - Steven’s remark "And assuring those characters don't get through to > the output is the grammar author's responsibility” leads to a story in > which an attempt to write out a non-XML character in ixml output is > an error in the grammar. Possibly, like other cases that have been > brought up, it’s what I would call a “run-time error in the grammar” — > that is, an error in the grammar that may be caught only for some > inputs, and which a processor is not obligated to detect in other > cases. A workaround to a spec weakness? > > It might be nicer to require the processor to detect the error regardless > of the input, but it might be very tricky to analyse a grammar and prove that > no possible input would ever cause an attempt to write a non-XML > character to the output. I would not swear that there is not a theorem > proving that it cannot be done, or that it’s equivalent to the Halting > Problem. All I know is that it doesn’t look easy. Fair enough, I can accept that. > > So: Steven is not proposing that input containing U+0001 be > illegal, nor that it be modified silently to change the character to > something else. He is observing that the grammar writer already > has the responsibility of saying what parts of the input get written > out to the XML output and is thus in a position to write a grammar > that ensures that non-XML characters do not appear in the output. Which IMHO leaves a hole needing a patch (in the spec) > > Those things could of course be proposed — you did propose, > if I understood you correctly, that ixml just specify that all inputs > have to be streams of XML characters, and I think that would make > life simpler for me as an implementor. No one that I know of has > proposed that non-XML characters in the input be legal but > silently changed to something else. I would like to propose that such characters be defined as illegal as input. Surely the simplest solution? > > I think the idea that a processor might modify the input may have > come from my musings about what my XDM-based processor might > do with a range like [#1 - #7e]. I could implement such a range > by providing a function that turns the input character into an integer > and compares that integer to the numbers 1 and 126, and signals > a match if 1 <= character-number <= 126. Or I could implement > such a range by checking the input character against the XPath > regular expression [	-~], which on the face of it does > not mean the same thing, but which is guaranteed to produce the > same result on every test that can be presented to my code. Since > I am working on XML 1.0 strings, I know in advance that character > U+0001 does not and cannot occur in my input, so I do not need to > find a way to write an XPath regular expression that deals with > that character; if I translate an ixml inclusion or exclusion into > an XPath regular expression, the requirement is that the XPath > regex have the correct behavior on all possible inputs. It is not > required that it have correct behavior on impossible inputs. All of which sounds like a nasty workaround to me? And unnecessary? Is this brought about by your use of XML documents as input data? > > I hope this helps. Thanks, I think so. regards -- Dave Pawson XSLT XSL-FO FAQ. Docbook FAQ.
Received on Monday, 3 January 2022 16:18:38 UTC