Re: non-XML characters (e.g. #1) from C. M. Sperberg-McQueen on 2022-01-02 (public-ixml@w3.org from January 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Sun, 2 Jan 2022 11:13:38 -0700
To: Dave Pawson <dave.pawson@gmail.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
Message-Id: <5AE7577E-2F25-4215-B7A6-ED849229DC55@blackmesatech.com>

> On 2,Jan2022, at 12:56 AM, Dave Pawson <dave.pawson@gmail.com> wrote:
> 
> Another scope issue Michael?
> You're point about "For processors which build their data structures direct from
> the ixml form," raises a 'dent' in a simple scoping statement (input
> must be utf-8 within XML
> character constraints).

At the moment, I don’t think we do have a rule requiring that input
fall within XML constraints.  If we did, I could simply update the
test case to require the result ’this is not a conforming grammar’.


>  Define it as 'user error' (you asked for it, you got it)? I don't like that.
> 
> Report it as a 'warning' with a reason? I think this would be my preference?

I’m not sure what it would mean to call this or other things user
errors.  I think we get to define rules for grammars and processors,
not users.

Michael


> 
> regards
> 
> 
> On Sat, 1 Jan 2022 at 22:29, C. M. Sperberg-McQueen
> <cmsmcq@blackmesatech.com> wrote:
>> 
>> Working though Steven’s tests (and making more corrections
>> in the expected results in tests-SP-MSM), I run across an
>> interesting policy issue:  what should a processor do with a
>> reference in a grammar to character #1?
>> 
>> It’s not an XML 1.0 character (the only C0 control characters
>> allowed in XML 1.0 are U+0009, U+000A, and U+000D),
>> so it cannot be represented in the XML form of the grammar.
>> 
>> For processors which build their data structures direct from
>> the ixml form, and which have no trouble with character U+0001,
>> a reference to #1 need cause no trouble, unless the user asks
>> the parser to turn that grammar itself into XML.  (And even
>> then, it may only matter in some contexts.)  At which point
>> we are back to an issue raised already: what happens when the
>> combination of input plus grammar produces non-well-formed
>> output?
>> 
>> And of course at least some processors which can handle #1
>> will not be able to handle #0.
>> 
>> What happens in my processor is that when I create the XML
>> form of the grammar in test hex3, all is well and I get the XML
>> 
>> <ixml>
>>  <rule name="hex">:<alt>
>>      <literal dstring="a"/>,<inclusion>[<range from="#1" to="#7e">-</range>]</inclusion>,<literal dstring="b"/>
>>    </alt>.</rule>
>> </ixml>
>> 
>> (As you can see, I have not yet updated my internal copy of
>> the ixml grammar, so colons and semicolons and such are
>> appearing as literals.)
>> 
>> When I compile the grammar,  the code naively attempts to
>> turn #1 into a character, and compilation fails.
>> 
>> If it’s a run-time error in the grammar, and the implicit claim is
>> that an error-free ixml grammar will never produce ill-formed
>> output on any input, then we have a run-time error in the
>> grammar for ixml grammars, since it does not forbid hex
>> references to non-XML (or indeed non-Unicode) characters.
>> 
>> What do people think?
>> 
>> What do we do about this?
>> 
>> Is [#1 - #7e] a legal range?
>> 
>> Michael
>> 
>> 
> 
> 
> -- 
> Dave Pawson
> XSLT XSL-FO FAQ.
> Docbook FAQ.
>

Received on Sunday, 2 January 2022 18:14:03 UTC