- From: <list.mu@c-moria.com>
- Date: Tue, 8 Oct 2024 22:46:26 +0200
- To: <xproc-dev@w3.org>
- Message-ID: <018301db19c3$25f08b40$71d1a1c0$@c-moria.com>
See below a message that Wendell just sent to me privately (assuming it was intended for the group). From: Piez, Wendell A. (Fed) <wendell.piez@nist.gov> Sent: Tuesday, 8 October 2024 21:54 To: list.mu@c-moria.com Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive Hello, This also seems like a problem that XProc can't solve, but try/catch could mitigate considerably. Cheers, Wendell From: list.mu@c-moria.com <mailto:list.mu@c-moria.com> <list.mu@c-moria.com <mailto:list.mu@c-moria.com> > Sent: Tuesday, October 8, 2024 2:48 PM To: 'Matthieu RICAUD-DUSSARGET' <m.ricaud-dussarget@lefebvre-dalloz.fr <mailto:m.ricaud-dussarget@lefebvre-dalloz.fr> >; xproc-dev@w3.org <mailto:xproc-dev@w3.org> Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive Hi Mathieu, A docx archive is just like a zip archive. I have a few pipelines running doing all sorts of things to docx archive files. I have seen the "XC0085 error : Error processing ZIP archive: zip END header not found" a few times and on every occasion it was because the "docx" was no real "docx" but a "doc" renamed or poorly transformed (or poorly generated by a process not Microsoft Word) A good test is to rename the docx causing the error to ".zip" and evaluating whether winzip (or so) can open it, before you start digging the code If you are on a windows environment, and you have a word installation, you could set up a batch using the winword.exe and open/save to docx with a macro (I can dig up the code for you if you want) I used to do it like this in a pre-production phase I am using Aspose Word these days (nothing can beat that library in my opinion) <https://products.aspose.com/words/> https://products.aspose.com/words/ I does a good transformation from doc to docx, but it also does some normalisation on the generated docx I don't set the content type in the load by the way And I set the format explicitly on the unarchive (but I believe that is default) <p:load href="{concat($input-path, $fname)}"></p:load> <p:unarchive format="zip"> <p:with-option name="include-filter" select="."/> </p:unarchive> I hope this helps somewhat Best regards Geert Bormans From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr <mailto:m.ricaud-dussarget@lefebvre-dalloz.fr> > Sent: Tuesday, 8 October 2024 19:31 To: xproc-dev@w3.org <mailto:xproc-dev@w3.org> Subject: Extract XML from docx file with xproc 3.0 p:unarchive Hi, I have to convert a big amount of docx files into a specific XML format. I wrote the XSLT that convert de myFile.docx!/word/document.xml after extracting it manually. I'd like to use Xproc to loop on a full directory of docx fildes to extract each document.xml apply the xslt and validate the result. After looping on each files of the directory i do : <p:variable name="docx.uri" select="ancestor::c:directory/base-uri(.) || c:file/base-uri(.)"/> <p:load href="{docx.uri}" name="load" content-type=" application/vnd.openxmlformats-officedocument.wordprocessingml.document "/> <p:unarchive> <p:with-input> <p:pipe step="load" port="result"/> </p:with-input> </p:unarchive> At this point (p:unarchive) I get a XC0085 error : Error processing ZIP archive: zip END header not found I tried different content-type like application/zip, but still have the same error. Does that mean it's not possible to extract .docx archive juste like a zip archive ? I was confident xproc could do that ? Or did I missed something here ? I'm using MorganaXProc-III 1.2.3 By the way most of the files I have are .doc not .docx, so if extraction has a solution from docx, I'll have to first convert them to docx (it seems there's a python script for it, I guess I can't do it from xproc ?) Thanks in advance for your help, Cheers, Matthieu Ricaud-Dussarget
Received on Tuesday, 8 October 2024 20:46:33 UTC