non-XML characters (e.g. #1)

Working though Steven’s tests (and making more corrections
in the expected results in tests-SP-MSM), I run across an
interesting policy issue:  what should a processor do with a
reference in a grammar to character #1?

It’s not an XML 1.0 character (the only C0 control characters
allowed in XML 1.0 are U+0009, U+000A, and U+000D),
so it cannot be represented in the XML form of the grammar.

For processors which build their data structures direct from
the ixml form, and which have no trouble with character U+0001,
a reference to #1 need cause no trouble, unless the user asks
the parser to turn that grammar itself into XML.  (And even
then, it may only matter in some contexts.)  At which point
we are back to an issue raised already: what happens when the
combination of input plus grammar produces non-well-formed
output?

And of course at least some processors which can handle #1
will not be able to handle #0.

What happens in my processor is that when I create the XML
form of the grammar in test hex3, all is well and I get the XML

<ixml>
  <rule name="hex">:<alt>
      <literal dstring="a"/>,<inclusion>[<range from="#1" to="#7e">-</range>]</inclusion>,<literal dstring="b"/>
    </alt>.</rule>
</ixml>

(As you can see, I have not yet updated my internal copy of
the ixml grammar, so colons and semicolons and such are
appearing as literals.)

When I compile the grammar,  the code naively attempts to
turn #1 into a character, and compilation fails.

If it’s a run-time error in the grammar, and the implicit claim is
that an error-free ixml grammar will never produce ill-formed
output on any input, then we have a run-time error in the
grammar for ixml grammars, since it does not forbid hex
references to non-XML (or indeed non-Unicode) characters.

What do people think?

What do we do about this?

Is [#1 - #7e] a legal range?

Michael

Received on Saturday, 1 January 2022 22:29:12 UTC