RE: Extract XML from docx file with xproc 3.0 p:unarchive from Matthieu RICAUD-DUSSARGET on 2024-10-08 (xproc-dev@w3.org from October 2024)

From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr>
Date: Tue, 8 Oct 2024 20:59:37 +0000
To: "list.mu@c-moria.com" <list.mu@c-moria.com>, "xproc-dev@w3.org" <xproc-dev@w3.org>
Message-ID: <PR3PR03MB6475EB8AAFB44543FAE916CEE67E2@PR3PR03MB6475.eurprd03.prod.outlook.com>

Hi all,

Thanks for your responses !

Christophe : yes the only docx file I have for the moment in the directory is valid : I can open it with 7-zip and get the xml document inside
It’s actually a .doc which I have converted with my MS Word (« save as  docx »)

Geert, thanks for all details, yes I might be interested with your script (though I have about 4 millions .docx to convert !)
I’ll also have a look to Aspose.

About my pipeline, thanks to your help I find the problem which was .. so dummy !
I forgot a $  before the docx.uri variable reference in <p:load href="{docx.uri}" … />
=> The href was empty, so I guess the fallback is to take the current xproc file as default and raides err:XC0085 "Cannot process document with media-type 'application/xproc+xml' as a ZIP archive"
Then when I added explicit binding to understand, then I specified (and force) the content-type …
I finally get an the err:XC0085 "Error processing ZIP archive: zip END header not found"  which made me more confused !

Sorry for that guys !

As for reminder for later my really simple xpl that works (display the xml inside the docx) :

<p:declare-step xmlns:p=http://www.w3.org/ns/xproc

  xmlns:c=http://www.w3.org/ns/xproc-step

  xmlns:xs=http://www.w3.org/2001/XMLSchema

  version="3.0">

  <p:input port="source" sequence="true"/>
  <p:output port="result" sequence="true"/>

  <p:option name="input-dir" select="resolve-uri('../../test/input-word', static-base-uri())" as="xs:string"/>

  <p:directory-list path="{$input-dir}" include-filter="\.docx$"/>

  <p:for-each>
    <p:with-input select="//c:file"/>
    <p:variable name="docx.uri" select="c:file/base-uri(.)"/>
    <p:identity message="Processing {$docx.uri}"/>
    <p:load href="{$docx.uri}"/>
    <p:unarchive format="zip" include-filter="word/document\.xml"/>
  </p:for-each>

</p:declare-step>

Cheers,
Matthieu Ricaud

Received on Tuesday, 8 October 2024 20:59:44 UTC