Re: non-XML characters (e.g. #1)

<grin/> I note your implementers view Michael!

Picking out the bits of interest to me.
Hope you can map those to your perspective.

On Mon, 3 Jan 2022 at 15:57, C. M. Sperberg-McQueen
<cmsmcq@blackmesatech.com> wrote:
>
>
>
> > On 3,Jan2022, at 3:27 AM, Dave Pawson <dave.pawson@gmail.com> wrote:
> >
> > On Mon, 3 Jan 2022 at 10:23, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> >>

> Let me try phrasing it differently.
>
> As specified, ixml maps input into XML.  The details of the mapping
> are specified in the grammar, so the precise form of the output depends
> both on the input and the output.  If the input were different, the output
> might be different; if the grammar were different, the output might be
> different.  Output depends on input + grammar.  So far, that will I hope
> be non-controversial.

To me, there is an implicit 'what you gave me, I pass through to output' there,
but I may be wrong.

>
> It is a consequence of the way XML is specified that character U+0001,
> among others, cannot appear in any XML 1.0 or XML 1.1 document,
> and cannot be referred to in any XML 1.0 document.  That, in turn,
> means that any attempt to include that character in the XML output
> of an XML processor is doomed to failure.

Agreed.


>
> So the mapping from non-XML data to XML data cannot always
> succeed.  What happens when it doesn’t?  Is it OK?  Is it an error
> in the processor?  Is it an error in the grammar? Is it an error in the
> input?  It doesn’t fit the short description of ixml, because we had
> input and a grammar and we did not get XML out at the other end.
> The spec needs a story of some kind.  What should that story be?
>
> The position Steven is suggesting is (as I understand it):
>
> - Input is allowed to contain any Unicode character.

A position with which I think is in error.


>
> - In order to describe the input, grammars may refer to (or contain)
> any Unicode character.

Again, I disagree and believe the spec should say so.


>
> - Steven’s remark "And assuring those characters don't get through to
> the output is the grammar author's responsibility” leads to a story in
> which an attempt to write out a non-XML character in ixml output is
> an error in the grammar.   Possibly, like other cases that have been
> brought up, it’s what I would call a “run-time error in the grammar” —
> that is, an error in the grammar that may be caught only for some
> inputs, and which a processor is not obligated to detect in other
> cases.

A workaround to a spec weakness?


>
> It might be nicer to require the processor to detect the error regardless
> of the input, but it might be very tricky to analyse a grammar and prove that
> no possible input would ever cause an attempt to write a non-XML
> character to the output.  I would not swear that there is not a theorem
> proving that it cannot be done, or that it’s equivalent to the Halting
> Problem.  All I know is that it doesn’t look easy.

Fair enough, I can accept that.


>
> So:  Steven is not proposing that input containing U+0001 be
> illegal, nor that it be modified silently to change the character to
> something else.  He is observing that the grammar writer already
> has the responsibility of saying what parts of the input get written
> out to the XML output and is thus in a position to write a grammar
> that ensures that non-XML characters do not appear in the output.

Which IMHO leaves a hole needing a patch (in the spec)

>
> Those things could of course be proposed — you did propose,
> if I understood you correctly, that ixml just specify that all inputs
> have to be streams of XML characters, and I think that would make
> life simpler for me as an implementor.  No one that I know of has
> proposed that non-XML characters in the input be legal but
> silently changed to something else.

I would like to propose that such characters be defined as illegal as input.
Surely the simplest solution?


>
> I think the idea that a processor might modify the input may have
> come from my musings about what my XDM-based processor might
> do with a range like [#1 - #7e].  I could implement such a range
> by providing a function that turns the input character into an integer
> and compares that integer to the numbers 1 and 126, and signals
> a match if 1 <= character-number <= 126.  Or I could implement
> such a range by checking the input character against the XPath
> regular expression [&#x09;-&#x7E;], which on the face of it does
> not mean the same thing, but which is guaranteed to produce the
> same result on every test that can be presented to my code.  Since
> I am working on XML 1.0 strings, I know in advance that character
> U+0001 does not and cannot occur in my input, so I do not need to
> find a way to write an XPath regular expression that deals with
> that character; if I translate an ixml inclusion or exclusion into
> an XPath regular expression, the requirement is that the XPath
> regex have the correct behavior on all possible inputs.  It is not
> required that it have correct behavior on impossible inputs.

All of which sounds like a nasty workaround to me? And unnecessary?
Is this brought about by your use of XML documents as input data?

>
> I hope this helps.

Thanks, I think so.

regards



-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.

Received on Monday, 3 January 2022 16:18:38 UTC