Re: Extract XML from docx file with xproc 3.0 p:unarchive from Christophe Marchand on 2024-10-08 (xproc-dev@w3.org from October 2024)

From: Christophe Marchand <cmarchand@clever-age.com>
Date: Tue, 8 Oct 2024 19:59:09 +0200
To: xproc-dev@w3.org
Message-ID: <99f7dc93-d140-44dc-877d-d0e306da3b02@clever-age.com>

I suppose your .docx file is a correct zip file that you are able to 
unzip with unzip command ?

Christophe

Le 08/10/2024 à 19:30, Matthieu RICAUD-DUSSARGET a écrit :
>
> Hi,
>
> I have to convert a big amount of docx files into a specific XML format.
>
> I wrote the XSLT that convert de myFile.docx!/word/document.xml after 
> extracting it manually.
>
> I’d like to use Xproc to loop on a full directory of docx fildes to 
> extract each document.xml apply the xslt and validate the result.
>
> After looping on each files of the directory i do :
>
> <p:variable name="docx.uri" select="ancestor::c:directory/base-uri(.) 
> || c:file/base-uri(.)"/>
>
> <p:load href="{docx.uri}" name="load" content-type=" 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document 
> "/>
>
> <p:unarchive>
>
>     <p:with-input>
>
>       <p:pipe step="load" port="result"/>
>
>     </p:with-input>
>
> </p:unarchive>
>
> At this point (p:unarchive) I get a XC0085 error : Error processing 
> ZIP archive: zip END header not found
>
> I tried different content-type like application/zip, but still have 
> the same error.
>
> Does that mean it’s not possible to extract .docx archive juste like a 
> zip archive ?
>
> I was confident xproc could do that ?
>
> Or did I missed something here ?
>
> I’m using MorganaXProc-III 1.2.3
>
> By the way most of the files I have are .doc not .docx, so if 
> extraction has a solution from docx, I’ll have to first convert them 
> to docx (it seems there’s a python script for it, I guess I can’t do 
> it from xproc ?)
>
> Thanks in advance for your help,
>
> Cheers,
>
> Matthieu Ricaud-Dussarget
>

Received on Tuesday, 8 October 2024 17:59:23 UTC