- From: Erik Siegel <erik@xatapult.nl>
- Date: Wed, 9 Oct 2024 08:12:06 +0200
- To: "'Matthieu RICAUD-DUSSARGET'" <m.ricaud-dussarget@lefebvre-dalloz.fr>, <list.mu@c-moria.com>, <xproc-dev@w3.org>
- Message-ID: <002201db1a12$2b076500$81162f00$@xatapult.nl>
Hi Matthieu, For what it’s worth: I have published open source ready-to-run XProc 3 code that interprets a .xlsx and .docx file into something more manageable: https://xoffice.xtpxlib.org/ Maybe that helps (even only as an example) Erik Siegel From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr> Sent: Tuesday, 8 October 2024 23:00 To: list.mu@c-moria.com; xproc-dev@w3.org Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive Hi all, Thanks for your responses ! Christophe : yes the only docx file I have for the moment in the directory is valid : I can open it with 7-zip and get the xml document inside It’s actually a .doc which I have converted with my MS Word (« save as docx ») Geert, thanks for all details, yes I might be interested with your script (though I have about 4 millions .docx to convert !) I’ll also have a look to Aspose. About my pipeline, thanks to your help I find the problem which was .. so dummy ! I forgot a $ before the docx.uri variable reference in <p:load href="{docx.uri}" … /> => The href was empty, so I guess the fallback is to take the current xproc file as default and raides err:XC0085 "Cannot process document with media-type 'application/xproc+xml' as a ZIP archive" Then when I added explicit binding to understand, then I specified (and force) the content-type … I finally get an the err:XC0085 "Error processing ZIP archive: zip END header not found" which made me more confused ! Sorry for that guys ! As for reminder for later my really simple xpl that works (display the xml inside the docx) : <p:declare-step xmlns:p=http://www.w3.org/ns/xproc xmlns:c=http://www.w3.org/ns/xproc-step xmlns:xs=http://www.w3.org/2001/XMLSchema version="3.0"> <p:input port="source" sequence="true"/> <p:output port="result" sequence="true"/> <p:option name="input-dir" select="resolve-uri('../../test/input-word', static-base-uri())" as="xs:string"/> <p:directory-list path="{$input-dir}" include-filter="\.docx$"/> <p:for-each> <p:with-input select="//c:file"/> <p:variable name="docx.uri" select="c:file/base-uri(.)"/> <p:identity message="Processing {$docx.uri}"/> <p:load href="{$docx.uri}"/> <p:unarchive format="zip" include-filter="word/document\.xml"/> </p:for-each> </p:declare-step> Cheers, Matthieu Ricaud
Received on Wednesday, 9 October 2024 06:12:15 UTC