Re: non-XML characters (e.g. #1) from Dave Pawson on 2022-01-04 (public-ixml@w3.org from January 2022)

From: Dave Pawson <dave.pawson@gmail.com>
Date: Tue, 4 Jan 2022 07:49:29 +0000
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: Steven Pemberton <steven.pemberton@cwi.nl>, ixml <public-ixml@w3.org>
Message-ID: <CAEncD4fuSB0sjfykLTTvQLU=zPz36=Kuc4jUPOWhom2gOMYVig@mail.gmail.com>
On Mon, 3 Jan 2022 at 15:57, C. M. Sperberg-McQueen
<cmsmcq@blackmesatech.com> wrote:
> >>
> >> On Monday 03 January 2022 11:19:25 (+01:00), Dave Pawson wrote:
> >>
> >>> On Mon, 3 Jan 2022 at 10:04, Steven Pemberton <steven.pemberton@cwi.nl>
> >> wrote:
> >>>> Output, on the other hand, is a different issue, because of the XML
> >>>> misfeature of excluding most C0 characters from content. (If I were
> >>>> redesigning XML, I would allow those characters, but only expressed in
> >>>> encoded form.)
> >>>>
> >>>> So, it is just fine to accept XML illegal characters in the input, as
> >> long as they don't end up in the output:
> >
> > Then what are you saying above?
> > I provide C0 char in, "it doesn't end up in the output"
> > IMHO that is modifying my data as given to the application?

I still assert you (Michael) are saying the same thing?

>
> Let me try phrasing it differently.


> So the mapping from non-XML data to XML data cannot always
> succeed.  What happens when it doesn’t?  Is it OK?  Is it an error
> in the processor?  Is it an error in the grammar? Is it an error in the
> input?  It doesn’t fit the short description of ixml, because we had
> input and a grammar and we did not get XML out at the other end.
> The spec needs a story of some kind.  What should that story be?
>
> The position Steven is suggesting is (as I understand it):

...

>
> - In order to ensure that the output is in fact XML, the grammar must
> see to it that any non-XML characters in the input do not get
> written out as data in an XML document.  The obvious way to do
> this is to mark the relevant terminals as hidden, as in Steven’s example
>
>     -[#0 - #1F]

And if the user a) has such input contained within the input and b)
has no such rule in the grammar?



>
> There may be other ways to write the grammar so as to ensure that
> a U+0001 in the input does not end up making it impossible for the
> processor to produce XML output, although I cannot think of any off hand.

Thanks Michael, at least I understand the mechanism (and that owness
rests simply
on the user!)


> - Steven’s remark "And assuring those characters don't get through to
> the output is the grammar author's responsibility” leads to a story in
> which an attempt to write out a non-XML character in ixml output is
> an error in the grammar.   Possibly, like other cases that have been
> brought up, it’s what I would call a “run-time error in the grammar” —
> that is, an error in the grammar that may be caught only for some
> inputs, and which a processor is not obligated to detect in other
> cases.

Either way, the user has made an error and surely must be
told of it.


>
> It might be nicer to require the processor to detect the error regardless
> of the input, but it might be very tricky to analyse a grammar and prove that
> no possible input would ever cause an attempt to write a non-XML
> character to the output.  I would not swear that there is not a theorem
> proving that it cannot be done, or that it’s equivalent to the Halting
> Problem.  All I know is that it doesn’t look easy.

This bit I don't understand.
   The grammar can tell you to map C0 chars to (nothing? omitted
from the output?) something, yet you couldn't spot them otherwise?
Is this what the para above says?


>
> So:  Steven is not proposing that input containing U+0001 be
> illegal, nor that it be modified silently to change the character to
> something else.  He is observing that the grammar writer already
> has the responsibility of saying what parts of the input get written
> out to the XML output and is thus in a position to write a grammar
> that ensures that non-XML characters do not appear in the output.

(Now) understood... I'd put a pound to a penny that many other
potential users will fall foul of this aspect of ixml.
 [Sorry Steven if I misinterpreted your initial comment]


>
> Those things could of course be proposed — you did propose,
> if I understood you correctly, that ixml just specify that all inputs
> have to be streams of XML characters, and I think that would make
> life simpler for me as an implementor.  No one that I know of has
> proposed that non-XML characters in the input be legal but
> silently changed to something else.

It was my misinterpretation.
  I'm wondering why Norm insists this is a good thing? What is
a user going to do with an ... invalid? Non-well-formed XML file
produced when the he/she omits one from the grammar and
her customer includes one in an input file? Chaos Norm?
  How many xml parsers can point to character n as being
bad in this way?


> I hope this helps.
>
> Michael

Yes, thanks Michael.

I still prefer the simplicity of XML valid input constraints (if that can be
checked)

regards


-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
Received on Tuesday, 4 January 2022 07:50:53 UTC