Re: non-XML characters (e.g. #1)

On Sun, 2 Jan 2022 at 18:13, C. M. Sperberg-McQueen <
cmsmcq@blackmesatech.com> wrote:

>
>
> > On 2,Jan2022, at 12:56 AM, Dave Pawson <dave.pawson@gmail.com> wrote:
> >
> > Another scope issue Michael?
> > You're point about "For processors which build their data structures
> direct from
> > the ixml form," raises a 'dent' in a simple scoping statement (input
> > must be utf-8 within XML
> > character constraints).
>
> At the moment, I don’t think we do have a rule requiring that input
> fall within XML constraints.  If we did, I could simply update the
> test case to require the result ’this is not a conforming grammar’.


No, i was suggesting such an addition.


>
>
> >  Define it as 'user error' (you asked for it, you got it)? I don't like
> that.
> >
> > Report it as a 'warning' with a reason? I think this would be my
> preference?
>
> I’m not sure what it would mean to call this or other things user
> errors.  I think we get to define rules for grammars and processors,
> not users.
>

A) the user asked for xml output, gave ‘bad’ input, hence imho a user error.
B) I would hate it if the processor modified my input without telling me?

Regards






> Michael
>
>
> >
> > regards
> >
> >
> > On Sat, 1 Jan 2022 at 22:29, C. M. Sperberg-McQueen
> > <cmsmcq@blackmesatech.com> wrote:
> >>
> >> Working though Steven’s tests (and making more corrections
> >> in the expected results in tests-SP-MSM), I run across an
> >> interesting policy issue:  what should a processor do with a
> >> reference in a grammar to character #1?
> >>
> >> It’s not an XML 1.0 character (the only C0 control characters
> >> allowed in XML 1.0 are U+0009, U+000A, and U+000D),
> >> so it cannot be represented in the XML form of the grammar.
> >>
> >> For processors which build their data structures direct from
> >> the ixml form, and which have no trouble with character U+0001,
> >> a reference to #1 need cause no trouble, unless the user asks
> >> the parser to turn that grammar itself into XML.  (And even
> >> then, it may only matter in some contexts.)  At which point
> >> we are back to an issue raised already: what happens when the
> >> combination of input plus grammar produces non-well-formed
> >> output?
> >>
> >> And of course at least some processors which can handle #1
> >> will not be able to handle #0.
> >>
> >> What happens in my processor is that when I create the XML
> >> form of the grammar in test hex3, all is well and I get the XML
> >>
> >> <ixml>
> >>  <rule name="hex">:<alt>
> >>      <literal dstring="a"/>,<inclusion>[<range from="#1"
> to="#7e">-</range>]</inclusion>,<literal dstring="b"/>
> >>    </alt>.</rule>
> >> </ixml>
> >>
> >> (As you can see, I have not yet updated my internal copy of
> >> the ixml grammar, so colons and semicolons and such are
> >> appearing as literals.)
> >>
> >> When I compile the grammar,  the code naively attempts to
> >> turn #1 into a character, and compilation fails.
> >>
> >> If it’s a run-time error in the grammar, and the implicit claim is
> >> that an error-free ixml grammar will never produce ill-formed
> >> output on any input, then we have a run-time error in the
> >> grammar for ixml grammars, since it does not forbid hex
> >> references to non-XML (or indeed non-Unicode) characters.
> >>
> >> What do people think?
> >>
> >> What do we do about this?
> >>
> >> Is [#1 - #7e] a legal range?
> >>
> >> Michael
> >>
> >>
> >
> >
> > --
> > Dave Pawson
> > XSLT XSL-FO FAQ.
> > Docbook FAQ.
> >
>
> --
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.

Received on Sunday, 2 January 2022 18:19:36 UTC