- From: <list.mu@c-moria.com>
- Date: Tue, 8 Oct 2024 20:48:09 +0200
- To: "'Matthieu RICAUD-DUSSARGET'" <m.ricaud-dussarget@lefebvre-dalloz.fr>, <xproc-dev@w3.org>
- Message-ID: <013901db19b2$9f8358e0$de8a0aa0$@c-moria.com>
Hi Mathieu, A docx archive is just like a zip archive. I have a few pipelines running doing all sorts of things to docx archive files. I have seen the "XC0085 error : Error processing ZIP archive: zip END header not found" a few times and on every occasion it was because the "docx" was no real "docx" but a "doc" renamed or poorly transformed (or poorly generated by a process not Microsoft Word) A good test is to rename the docx causing the error to ".zip" and evaluating whether winzip (or so) can open it, before you start digging the code If you are on a windows environment, and you have a word installation, you could set up a batch using the winword.exe and open/save to docx with a macro (I can dig up the code for you if you want) I used to do it like this in a pre-production phase I am using Aspose Word these days (nothing can beat that library in my opinion) https://products.aspose.com/words/ I does a good transformation from doc to docx, but it also does some normalisation on the generated docx I don't set the content type in the load by the way And I set the format explicitly on the unarchive (but I believe that is default) <p:load href="{concat($input-path, $fname)}"></p:load> <p:unarchive format="zip"> <p:with-option name="include-filter" select="."/> </p:unarchive> I hope this helps somewhat Best regards Geert Bormans From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr> Sent: Tuesday, 8 October 2024 19:31 To: xproc-dev@w3.org Subject: Extract XML from docx file with xproc 3.0 p:unarchive Hi, I have to convert a big amount of docx files into a specific XML format. I wrote the XSLT that convert de myFile.docx!/word/document.xml after extracting it manually. I'd like to use Xproc to loop on a full directory of docx fildes to extract each document.xml apply the xslt and validate the result. After looping on each files of the directory i do : <p:variable name="docx.uri" select="ancestor::c:directory/base-uri(.) || c:file/base-uri(.)"/> <p:load href="{docx.uri}" name="load" content-type=" application/vnd.openxmlformats-officedocument.wordprocessingml.document "/> <p:unarchive> <p:with-input> <p:pipe step="load" port="result"/> </p:with-input> </p:unarchive> At this point (p:unarchive) I get a XC0085 error : Error processing ZIP archive: zip END header not found I tried different content-type like application/zip, but still have the same error. Does that mean it's not possible to extract .docx archive juste like a zip archive ? I was confident xproc could do that ? Or did I missed something here ? I'm using MorganaXProc-III 1.2.3 By the way most of the files I have are .doc not .docx, so if extraction has a solution from docx, I'll have to first convert them to docx (it seems there's a python script for it, I guess I can't do it from xproc ?) Thanks in advance for your help, Cheers, Matthieu Ricaud-Dussarget
Received on Tuesday, 8 October 2024 18:48:16 UTC