Re: XML documents with DTDs in zip files from Norm Tovey-Walsh on 2025-12-04 (xproc-dev@w3.org from December 2025)

From: Norm Tovey-Walsh <ndw@nwalsh.com>
Date: Thu, 04 Dec 2025 16:50:36 +0000
To: Wendell Piez <wapiez@wendellpiez.com>
Cc: XProc Dev <xproc-dev@w3.org>
Message-ID: <m25xamjin7.fsf@nwalsh.com>

Wendell Piez <wapiez@wendellpiez.com> writes:
> This happens because the parser is obliged to attempt to parse XML files it finds in the archive, and produce errors for files that are not found to be well-formed?

Yes.

> Is there a way to use p:unarchive without this behavior? (@override-content-types?)

Yes, you can provide an extension-to-content-type mapping, so you could load a .xml file as text/plain if you wanted to.

> Are any other steps implicated besides p:unarchive, and is this the same for them?

Any step that loads XML has the same parsing obligations. That’s at least p:load, p:document, p:unarchive, p:uncompress, p:http-request[*], p:cast-content-type, p:invisible-xml, and p:os-exec (with some options). (I just glanced through the list of steps, I might have missed some.)

[*] The p:http-request case is especially interesting because the obligations are on XProc if-and-only-if the server-asserted content type is an XML content type. That said, there’s override-content-type on p:http-request, so you work around a broken server if you need to.

> As a user, if I try unarchive on a file I don't actually want it to blow up when it can't parse the XML. I actually want to be able to pass around broken XML (even misnamed with an 'xml' suffix) if that's what I have. (Or to be more precise: when someone does this to me, I would rather be able to see the broken file, unzipped, before complaining.)

You can do that with override-content-types.

> In other words I'm not sure this is about finding DTDs as much as about the behavior of p:unarchive applied to zips with problematic contents.

I see your point, but I think it’s slightly different. Parsing the well-formed XML, ignoring the external subset, is much, much more useful most of the time, I would guess.

You might argue, of course, that it’s wrong. And if the document *contains* entity references, it is wrong. So maybe “do nothing” is the right answer.

Part of this is also tricky because you can control some aspects of the parser on the parser itself. If you’ve setup the XMLReader that you’re asking me to use, I get what the reader gives me.

> But a step that could read a zip file and report such errors on purported-XML contents might be quite useful! Now I am going to think about that.

Off the top of my head, load them all as text files so that you can get them out of the archive in one piece, then do a for-each loop over them and use cast-content-type inside a try/catch to filter the wheat from the chaff.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh <ndw@nwalsh.com>
https://norm.tovey-walsh.com/

> I think it's much more interesting to live not knowing than have
> answers which might be wrong.--Richard Feynman

Received on Thursday, 4 December 2025 16:50:43 UTC