RE: Extract XML from docx file with xproc 3.0 p:unarchive from list.mu@c-moria.com on 2024-10-08 (xproc-dev@w3.org from October 2024)

From: <list.mu@c-moria.com>
Date: Tue, 8 Oct 2024 20:53:48 +0200
To: "'Matthieu RICAUD-DUSSARGET'" <m.ricaud-dussarget@lefebvre-dalloz.fr>, <xproc-dev@w3.org>
Message-ID: <014601db19b3$69c2b510$3d481f30$@c-moria.com>

You don't need to connect the ports in XProc 3.

So your example could be as simple as

 

<p:variable name="docx.uri" select="ancestor::c:directory/base-uri(.) ||
c:file/base-uri(.)"/>

<p:load href="{docx.uri}" />

<p:unarchive/>

 

Though, if all you need is the document.xml from the docx package, you could
put that one in an include filter so you get only one document out of the
archive into the next step

 

<p:variable name="docx.uri" select="ancestor::c:directory/base-uri(.) ||
c:file/base-uri(.)"/>

<p:load href="{docx.uri}" />

<p:unarchive format="zip">
             <p:with-option name="include-filter"
                 select="('word/document\.xml')"/>
</p:unarchive>

 

Good luck

 

Geert

 

 

From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr> 
Sent: Tuesday, 8 October 2024 19:31
To: xproc-dev@w3.org
Subject: Extract XML from docx file with xproc 3.0 p:unarchive

 

Hi, 

 

I have to convert a big amount of docx files into a specific XML format.

I wrote the XSLT that convert de myFile.docx!/word/document.xml after
extracting it manually.

 

I'd like to use Xproc to loop on a full directory of docx fildes to extract
each document.xml apply the xslt and validate the result.

 

After looping on each files of the directory i do : 

 

<p:variable name="docx.uri" select="ancestor::c:directory/base-uri(.) ||
c:file/base-uri(.)"/>

<p:load href="{docx.uri}" name="load" content-type="
application/vnd.openxmlformats-officedocument.wordprocessingml.document "/>

<p:unarchive>

    <p:with-input>

      <p:pipe step="load" port="result"/>

    </p:with-input>

</p:unarchive>

 

At this point (p:unarchive) I get a XC0085 error : Error processing ZIP
archive: zip END header not found

I tried different content-type like application/zip, but still have the same
error.

 

Does that mean it's not possible to extract .docx archive juste like a zip
archive ?

I was confident xproc could do that ?

Or did I missed something here ?

 

I'm using MorganaXProc-III 1.2.3

 

By the way most of the files I have are .doc not .docx, so if extraction has
a solution from docx, I'll have to first convert them to docx (it seems
there's a python script for it, I guess I can't do it from xproc ?)

 

Thanks in advance for your help,

Cheers,

 

Matthieu Ricaud-Dussarget

Received on Tuesday, 8 October 2024 18:53:56 UTC