Re: non-XML characters (e.g. #1)

> On 3,Jan2022, at 3:27 AM, Dave Pawson <dave.pawson@gmail.com> wrote:
> 
> On Mon, 3 Jan 2022 at 10:23, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
>> 
>> 
>> 
>> On Monday 03 January 2022 11:19:25 (+01:00), Dave Pawson wrote:
>> 
>>> On Mon, 3 Jan 2022 at 10:04, Steven Pemberton <steven.pemberton@cwi.nl>
>> wrote:
>>>> Output, on the other hand, is a different issue, because of the XML
>>>> misfeature of excluding most C0 characters from content. (If I were
>>>> redesigning XML, I would allow those characters, but only expressed in
>>>> encoded form.)
>>>> 
>>>> So, it is just fine to accept XML illegal characters in the input, as
>> long
>>>> as they don't end up in the output:
> 
> Then what are you saying above?
> I provide C0 char in, "it doesn't end up in the output"
> IMHO that is modifying my data as given to the application?

Let me try phrasing it differently.  

As specified, ixml maps input into XML.  The details of the mapping
are specified in the grammar, so the precise form of the output depends
both on the input and the output.  If the input were different, the output
might be different; if the grammar were different, the output might be
different.  Output depends on input + grammar.  So far, that will I hope 
be non-controversial.

It is a consequence of the way XML is specified that character U+0001, 
among others, cannot appear in any XML 1.0 or XML 1.1 document,
and cannot be referred to in any XML 1.0 document.  That, in turn,
means that any attempt to include that character in the XML output 
of an XML processor is doomed to failure. 

So the mapping from non-XML data to XML data cannot always
succeed.  What happens when it doesn’t?  Is it OK?  Is it an error
in the processor?  Is it an error in the grammar? Is it an error in the
input?  It doesn’t fit the short description of ixml, because we had
input and a grammar and we did not get XML out at the other end.
The spec needs a story of some kind.  What should that story be?

The position Steven is suggesting is (as I understand it):

- Input is allowed to contain any Unicode character.

- In order to describe the input, grammars may refer to (or contain)
any Unicode character.

- In order to ensure that the output is in fact XML, the grammar must
see to it that any non-XML characters in the input do not get 
written out as data in an XML document.  The obvious way to do
this is to mark the relevant terminals as hidden, as in Steven’s example 

    -[#0 - #1F]

There may be other ways to write the grammar so as to ensure that 
a U+0001 in the input does not end up making it impossible for the
processor to produce XML output, although I cannot think of any off hand.
(If we had a way to replace a character with its hex code, I could 
write a grammar to write out a U+0001 character as \u0001 or 
&my-ncr-0001; or <?hex 0001?> or something similar, using a non-standard
method of escaping that character in an XML context (because there
is no standard way).  But we don’t have that in ixml now and no one
has suggested it.)

- Steven’s remark "And assuring those characters don't get through to 
the output is the grammar author's responsibility” leads to a story in
which an attempt to write out a non-XML character in ixml output is
an error in the grammar.   Possibly, like other cases that have been
brought up, it’s what I would call a “run-time error in the grammar” —
that is, an error in the grammar that may be caught only for some
inputs, and which a processor is not obligated to detect in other
cases.  

It might be nicer to require the processor to detect the error regardless
of the input, but it might be very tricky to analyse a grammar and prove that
no possible input would ever cause an attempt to write a non-XML
character to the output.  I would not swear that there is not a theorem
proving that it cannot be done, or that it’s equivalent to the Halting
Problem.  All I know is that it doesn’t look easy.  

So:  Steven is not proposing that input containing U+0001 be
illegal, nor that it be modified silently to change the character to
something else.  He is observing that the grammar writer already
has the responsibility of saying what parts of the input get written
out to the XML output and is thus in a position to write a grammar
that ensures that non-XML characters do not appear in the output.

Those things could of course be proposed — you did propose,
if I understood you correctly, that ixml just specify that all inputs
have to be streams of XML characters, and I think that would make
life simpler for me as an implementor.  No one that I know of has
proposed that non-XML characters in the input be legal but 
silently changed to something else.

I think the idea that a processor might modify the input may have
come from my musings about what my XDM-based processor might
do with a range like [#1 - #7e].  I could implement such a range
by providing a function that turns the input character into an integer
and compares that integer to the numbers 1 and 126, and signals
a match if 1 <= character-number <= 126.  Or I could implement
such a range by checking the input character against the XPath 
regular expression [&#x09;-&#x7E;], which on the face of it does
not mean the same thing, but which is guaranteed to produce the
same result on every test that can be presented to my code.  Since
I am working on XML 1.0 strings, I know in advance that character
U+0001 does not and cannot occur in my input, so I do not need to
find a way to write an XPath regular expression that deals with 
that character; if I translate an ixml inclusion or exclusion into
an XPath regular expression, the requirement is that the XPath
regex have the correct behavior on all possible inputs.  It is not 
required that it have correct behavior on impossible inputs.

I hope this helps.

Michael

Received on Monday, 3 January 2022 15:57:29 UTC