RE: Inaccessibility of [#80-#FF] from John Dziurlaj on 2025-06-24 (public-ixml@w3.org from June 2025)

From: John Dziurlaj <john@turnout.rocks>
Date: Tue, 24 Jun 2025 10:48:27 +0000
To: Steven Pemberton <steven.pemberton@cwi.nl>, "public-ixml@w3.org" <public-ixml@w3.org>
Message-ID: <DS7PR20MB3999FBC3830A48587994BB5CC278A@DS7PR20MB3999.namprd20.prod.outlook.com>


comment-line: "%", -char+, eol.
char: ~[#a; #d].
eol: [#a; #d].

This works only because the Unicode replacement character is now included in the iXML character class. As a result, the actual byte content of the original comment-line is lost; the XML output does not preserve the original characters, only their substituted form.

<?xml version="1.0" encoding="utf-8"?><start><comment-line>%PDF-1.7<eol>
</eol></comment-line><comment-line>%����<eol>
</eol></comment-line></start>
(Markup Blitz 1.8)

Bytes in the range of 128-255 can appear in normal comments too, and would need to be mapped to XML, somehow.

E.g.:

% Author:  Leandra Yésica

In such cases, I would expect those bytes to be surfaced as valid Unicode code points in the corresponding XML representation (e.g., &#233; for U+00E9, the character 'é'), rather than being silently replaced.

John

Received on Tuesday, 24 June 2025 10:48:36 UTC