- From: John Dziurlaj <john@turnout.rocks>
- Date: Sun, 22 Jun 2025 13:56:05 +0000
- To: ixml <public-ixml@w3.org>
- Message-ID: <DS7PR20MB39991F1CDA7C6F7ADA49F32FC27EA@DS7PR20MB3999.namprd20.prod.outlook.com>
Suppose for some reason I am trying to parse a PDF using iXML. By convention, the second line of a PDF includes at least four binary characters, that is, characters whose codes are 128 or greater (even though much of the rest can be parsed as 7-bit ASCII). The following two lines of a PDF are given below (Latin-1 encoding) %PDF-1.7 (line 1) %öäüß A corresponding iXML grammar fragment intended to recognize these lines could be defined as follows: start: comment-line+. comment-line: "%",-char+, eol. char: [#0-#9]; [#b-#c]; [#e-#ff]. -eol: [#d{carriage return};#a{linefeed}]. However, testing with at least a couple iXML processors has revealed an issue: when parsing a file containing the above example, the processor emits an error of the form: <fail xmlns:ixml='http://invisiblexml.org/NS' ixml:state='failed'><line>2</line><column>2</column><pos>11</pos><unexpected codepoint='#FFFD'>?</unexpected></fail> (NineML 3.2.9) This indicates that the parser has encountered the Unicode Replacement Character (U+FFFD) at the location of the second character on line 2. This suggests that the input stream was preprocessed as UTF-* before applying the iXML grammar. Consequently, characters that fall within the Latin-1 upper half (0x80–0xFF) become inaccessible to rules that depend on the char definition above. For this use case, it is entirely acceptable—and arguably preferable—for an iXML implementation to treat the upper half of the 8-bit range (0x80–0xFF) as opaque binary values. Instead, it is sufficient that the input bytes be preserved as-is and surfaced into the resulting XML as Unicode code points U+0080 through U+00FF, respectively. <comment-line>%öäüß</comment-line> Regards, John Dziurłaj /d͡ʑurwaj/
Received on Sunday, 22 June 2025 13:56:11 UTC