RE: Extract XML from docx file with xproc 3.0 p:unarchive from list.mu@c-moria.com on 2024-10-08 (xproc-dev@w3.org from October 2024)

From: <list.mu@c-moria.com>
Date: Tue, 8 Oct 2024 20:48:09 +0200
To: "'Matthieu RICAUD-DUSSARGET'" <m.ricaud-dussarget@lefebvre-dalloz.fr>, <xproc-dev@w3.org>
Message-ID: <013901db19b2$9f8358e0$de8a0aa0$@c-moria.com>

Hi Mathieu,

 

A docx archive is just like a zip archive.

 

I have a few pipelines running doing all sorts of things to docx archive
files.

I have seen the "XC0085 error : Error processing ZIP archive: zip END header
not found" a few times

and on every occasion it was because the "docx" was no real "docx" but a
"doc" renamed or poorly transformed

(or poorly generated by a process not Microsoft Word)

 

A good test is to rename the docx causing the error to ".zip" and evaluating
whether winzip (or so) can open it, before you start digging the code

 

If you are on a windows environment, and you have a word installation, you
could set up a batch using the winword.exe and open/save to docx with a
macro

(I can dig up the code for you if you want)

I used to do it like this in a pre-production phase

 

I am using Aspose Word these days (nothing can beat that library in my
opinion)

https://products.aspose.com/words/

I does a good transformation from doc to docx, but it also does some
normalisation on the generated docx

 

I don't set the content type in the load by the way

And I set the format explicitly on the unarchive (but I believe that is
default)

 

<p:load href="{concat($input-path, $fname)}"></p:load>

        <p:unarchive format="zip">
            <p:with-option name="include-filter" select="."/>
        </p:unarchive>

 

I hope this helps somewhat

 

Best regards

 

Geert Bormans

 

From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr> 
Sent: Tuesday, 8 October 2024 19:31
To: xproc-dev@w3.org
Subject: Extract XML from docx file with xproc 3.0 p:unarchive

 

Hi, 

 

I have to convert a big amount of docx files into a specific XML format.

I wrote the XSLT that convert de myFile.docx!/word/document.xml after
extracting it manually.

 

I'd like to use Xproc to loop on a full directory of docx fildes to extract
each document.xml apply the xslt and validate the result.

 

After looping on each files of the directory i do : 

 

<p:variable name="docx.uri" select="ancestor::c:directory/base-uri(.) ||
c:file/base-uri(.)"/>

<p:load href="{docx.uri}" name="load" content-type="
application/vnd.openxmlformats-officedocument.wordprocessingml.document "/>

<p:unarchive>

    <p:with-input>

      <p:pipe step="load" port="result"/>

    </p:with-input>

</p:unarchive>

 

At this point (p:unarchive) I get a XC0085 error : Error processing ZIP
archive: zip END header not found

I tried different content-type like application/zip, but still have the same
error.

 

Does that mean it's not possible to extract .docx archive juste like a zip
archive ?

I was confident xproc could do that ?

Or did I missed something here ?

 

I'm using MorganaXProc-III 1.2.3

 

By the way most of the files I have are .doc not .docx, so if extraction has
a solution from docx, I'll have to first convert them to docx (it seems
there's a python script for it, I guess I can't do it from xproc ?)

 

Thanks in advance for your help,

Cheers,

 

Matthieu Ricaud-Dussarget

Received on Tuesday, 8 October 2024 18:48:16 UTC