Re: non-XML characters (e.g. #1)

Dave Pawson <dave.pawson@gmail.com> writes:
> On Mon, 3 Jan 2022 at 10:23, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
>> On Monday 03 January 2022 11:19:25 (+01:00), Dave Pawson wrote:
>>  > On Mon, 3 Jan 2022 at 10:04, Steven Pemberton <steven.pemberton@cwi.nl>
>>  > >
>>  > > So, it is just fine to accept XML illegal characters in the
>>  > > input, as long as they don't end up in the output:
>
> Then what are you saying above?
> I provide C0 char in, "it doesn't end up in the output"
> IMHO that is modifying my data as given to the application?

But modifying data is what ixml *is for*.

You write a grammar that translates some non-XML format into XML. Along
the way, you decide what items in the non-XML format get turned into
attributes, what items get turned into elements, what items get output
as characters, and what items get omitted.

All Steven is saying is that if you write a grammar that accepts input
that contains C0 control characters, you better make sure all the C0
control charactesr get omitted if you’re going to make XML at the end of
the day.

Consider this grammar for amounts of money in GBP (written on the fly
and untested, YMMV):

cost: "£"? digit+ ("." digit+)? .
-digit: ["0"-"9"] .

If you parse “£1234.56” with that grammar, you get

<cost>£1234.56</cost>

Suppose for the sake of argument that “£” was not a valid XML character.
Then that XML output would be invalid. And that would be because *you*
wrote a grammar that generated something invalid!

You could instead have written the grammar like this:

cost: -"£"? digit+ ("." digit+)? .
-digit: ["0"-"9"] .

And then you’d get

<cost>1234.56</cost>

That logic applies for all characters (actually) not valid in XML.

Does that help?

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Monday, 3 January 2022 16:31:26 UTC