Re: Inaccessibility of [#80-#FF] from David Birnbaum on 2025-06-22 (public-ixml@w3.org from June 2025)

From: David Birnbaum <djbpitt@gmail.com>
Date: Sun, 22 Jun 2025 16:00:15 +0200
To: John Dziurlaj <john@turnout.rocks>
Cc: ixml <public-ixml@w3.org>
Message-ID: <CAP4v81rMASanQm2OF+o0pr+QUxVfvF29uAY37RxwaTc+WLGg3w@mail.gmail.com>

Dear John,

When I ran into that problem, specifying XML 1.1 let me match the
characters in question, and I could then suppress them and write something
acceptable into the output. See
https://github.com/djbpitt/ixml/tree/main/non-xml-characters for a toy
example. If I've understood correctly, U+x0000 cannot be mated this way,
but other characters can.

Cheers,

David,

On Sun, Jun 22, 2025 at 3:56 PM John Dziurlaj <john@turnout.rocks> wrote:

> Suppose for some reason I am trying to parse a PDF using iXML. By
> convention, the second line of a PDF includes at least four binary
> characters, that is, characters whose codes are 128 or greater (even though
> much of the rest can be parsed as 7-bit ASCII). The following two lines of
> a PDF are given below (Latin-1 encoding)
>
>
>
> %PDF-1.7 (line 1)
>
> %öäüß
>
>
>
> A corresponding iXML grammar fragment intended to recognize these lines
> could be defined as follows:
>
>
>
> start: comment-line+.
>
> comment-line: "%",-char+, eol.
>
> char: [#0-#9]; [#b-#c]; [#e-#ff].
>
> -eol: [#d{carriage return};#a{linefeed}].
>
>
>
> However, testing with at least a couple iXML processors has revealed an
> issue: when parsing a file containing the above example, the processor
> emits an error of the form:
>
>
>
> <fail xmlns:ixml='http://invisiblexml.org/NS'
> ixml:state='failed'><line>2</line><column>2</column><pos>11</pos><unexpected
> codepoint='#FFFD'>?</unexpected></fail>
>
> (NineML 3.2.9)
>
>
>
> This indicates that the parser has encountered the Unicode Replacement
> Character (U+FFFD) at the location of the second character on line 2. This
> suggests that the input stream was preprocessed as UTF-* before applying
> the iXML grammar. Consequently, characters that fall within the Latin-1
> upper half (0x80–0xFF) become inaccessible to rules that depend on the char
> definition above.
>
>
>
> For this use case, it is entirely acceptable—and arguably preferable—for
> an iXML implementation to treat the upper half of the 8-bit range
> (0x80–0xFF) as opaque binary values.  Instead, it is sufficient that the
> input bytes be preserved as-is and surfaced into the resulting XML as
> Unicode code points U+0080 through U+00FF, respectively.
>
>
>
> <comment-line>%&#246;&#228;&#252;&#223;</comment-line>
>
>
>
> Regards,
>
>
>
> John Dziurłaj /d͡ʑurwaj/
>

Received on Sunday, 22 June 2025 14:00:32 UTC