- From: David Birnbaum <djbpitt@gmail.com>
- Date: Sun, 22 Jun 2025 16:00:15 +0200
- To: John Dziurlaj <john@turnout.rocks>
- Cc: ixml <public-ixml@w3.org>
- Message-ID: <CAP4v81rMASanQm2OF+o0pr+QUxVfvF29uAY37RxwaTc+WLGg3w@mail.gmail.com>
Dear John, When I ran into that problem, specifying XML 1.1 let me match the characters in question, and I could then suppress them and write something acceptable into the output. See https://github.com/djbpitt/ixml/tree/main/non-xml-characters for a toy example. If I've understood correctly, U+x0000 cannot be mated this way, but other characters can. Cheers, David, On Sun, Jun 22, 2025 at 3:56 PM John Dziurlaj <john@turnout.rocks> wrote: > Suppose for some reason I am trying to parse a PDF using iXML. By > convention, the second line of a PDF includes at least four binary > characters, that is, characters whose codes are 128 or greater (even though > much of the rest can be parsed as 7-bit ASCII). The following two lines of > a PDF are given below (Latin-1 encoding) > > > > %PDF-1.7 (line 1) > > %öäüß > > > > A corresponding iXML grammar fragment intended to recognize these lines > could be defined as follows: > > > > start: comment-line+. > > comment-line: "%",-char+, eol. > > char: [#0-#9]; [#b-#c]; [#e-#ff]. > > -eol: [#d{carriage return};#a{linefeed}]. > > > > However, testing with at least a couple iXML processors has revealed an > issue: when parsing a file containing the above example, the processor > emits an error of the form: > > > > <fail xmlns:ixml='http://invisiblexml.org/NS' > ixml:state='failed'><line>2</line><column>2</column><pos>11</pos><unexpected > codepoint='#FFFD'>?</unexpected></fail> > > (NineML 3.2.9) > > > > This indicates that the parser has encountered the Unicode Replacement > Character (U+FFFD) at the location of the second character on line 2. This > suggests that the input stream was preprocessed as UTF-* before applying > the iXML grammar. Consequently, characters that fall within the Latin-1 > upper half (0x80–0xFF) become inaccessible to rules that depend on the char > definition above. > > > > For this use case, it is entirely acceptable—and arguably preferable—for > an iXML implementation to treat the upper half of the 8-bit range > (0x80–0xFF) as opaque binary values. Instead, it is sufficient that the > input bytes be preserved as-is and surfaced into the resulting XML as > Unicode code points U+0080 through U+00FF, respectively. > > > > <comment-line>%öäüß</comment-line> > > > > Regards, > > > > John Dziurłaj /d͡ʑurwaj/ >
Received on Sunday, 22 June 2025 14:00:32 UTC