Re: non-XML characters (e.g. #1)

> On 1,Jan2022, at 3:28 PM, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
> ...
> Is [#1 - #7e] a legal range?
> 

Well, the current spec says 

> The digits are interpreted as a number in hexadecimal, and the character at that Unicode code-point is used [Unicode]. The number must be within the Unicode code-point range.

and does not say they are required to be XML characters, so
under the current spec, [#1 - #7e] is indeed a legal range.

I am not sure what a conforming processor should do if
asked to serialize, in XML, an instance of the grammar

  S = #1+.

which recognizes the language consisting of non empty sequences
of character U+0001, which cannot be represented in 
XML 1.0.

I am also not sure what my processor should do when
trying to implement a range like [#1 - #7e].  I suppose 
I can (a) detect that the only XML characters that fall
into that range also fall into the range [#9 - #7e] and
(b) represent the two ranges the same way internally.

That is — if a non-XML character number is used as the
first character in a range, find the next larger XML character
and use it, instead.  If it’s the final character in the range,
find the next smaller XML character and use that instead.

And if such a non-XML character is used as an encoded literal, 
it can never be matched by any XML character, and thus not
by any character than can occur in my input (since I work
only on XML strings, either supplied as values or read from
files), and it is thus equivalent to [].

So it should be possible for implementations like mine to 
live with the current rule.  I’m not sure I like it, but it’s
possible, assuming I have correctly identified the cases in
which I may see a hex reference to a non-XML character.

From my point of view, a simpler fix would be to change 
the second sentence of the quoted material to:

    The number must denote an XML character.

or

    The number must denote a character which matches
    the Char nonterminal in the XML specification.

In that case, processors which want to accept data with
non-XML characters will do so as an extension, just as
processors which identify grammatical prefixes of the input
or which return alternate representations of the parse
tree(s) do.

What do others think?

Michael

Received on Sunday, 2 January 2022 16:47:08 UTC