- From: Piez, Wendell A. (Fed) <wendell.piez@nist.gov>
- Date: Tue, 8 Oct 2024 21:55:51 +0000
- To: XProc Dev <xproc-dev@w3.org>
- Message-ID: <SA9PR09MB58241335DD7586DFFE34B510FF7E2@SA9PR09MB5824.namprd09.prod.outlook.com>
Bah!
From: Piez, Wendell A. (Fed)
Sent: Tuesday, October 8, 2024 5:55 PM
To: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr>
Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive
Matthieu -- oops!
I wrote also to suggest try/catch for you, but it appears the email went only to Geert. (Sorry Geert.)
Probably not the last time - and of course it wouldn't have solved this problem, only helped to mitigate similar problems caused by actual errors in inputs, not errors in the code.
For that matter I have also been bitten by the fallback to read the XProc at path "".
One thing that occurred to me would be to prepend any href to a relative path:
<p:load href="./{expr}"/>
And this does error out (in Morgana) if 'expr' evaluates to the empty string.
But part of me says this is bad form (it feels strange and awkward), and I should just rely on runtime messaging to expose the values for debugging.
Comments?
Regards, Wendell
From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr<mailto:m.ricaud-dussarget@lefebvre-dalloz.fr>>
Sent: Tuesday, October 8, 2024 5:00 PM
To: list.mu@c-moria.com<mailto:list.mu@c-moria.com>; xproc-dev@w3.org<mailto:xproc-dev@w3.org>
Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive
Hi all,
Thanks for your responses !
Christophe : yes the only docx file I have for the moment in the directory is valid : I can open it with 7-zip and get the xml document inside
It's actually a .doc which I have converted with my MS Word (< save as docx >)
Geert, thanks for all details, yes I might be interested with your script (though I have about 4 millions .docx to convert !)
I'll also have a look to Aspose.
About my pipeline, thanks to your help I find the problem which was .. so dummy !
I forgot a $ before the docx.uri variable reference in <p:load href="{docx.uri}" ... />
=> The href was empty, so I guess the fallback is to take the current xproc file as default and raides err:XC0085 "Cannot process document with media-type 'application/xproc+xml' as a ZIP archive"
Then when I added explicit binding to understand, then I specified (and force) the content-type ...
I finally get an the err:XC0085 "Error processing ZIP archive: zip END header not found" which made me more confused !
Sorry for that guys !
As for reminder for later my really simple xpl that works (display the xml inside the docx) :
<p:declare-step xmlns:p=http://www.w3.org/ns/xproc
xmlns:c=http://www.w3.org/ns/xproc-step
xmlns:xs=http://www.w3.org/2001/XMLSchema
version="3.0">
<p:input port="source" sequence="true"/>
<p:output port="result" sequence="true"/>
<p:option name="input-dir" select="resolve-uri('../../test/input-word', static-base-uri())" as="xs:string"/>
<p:directory-list path="{$input-dir}" include-filter="\.docx$"/>
<p:for-each>
<p:with-input select="//c:file"/>
<p:variable name="docx.uri" select="c:file/base-uri(.)"/>
<p:identity message="Processing {$docx.uri}"/>
<p:load href="{$docx.uri}"/>
<p:unarchive format="zip" include-filter="word/document\.xml"/>
</p:for-each>
</p:declare-step>
Cheers,
Matthieu Ricaud
Received on Tuesday, 8 October 2024 21:55:58 UTC