RE: Extract XML from docx file with xproc 3.0 p:unarchive from Erik Siegel on 2024-10-09 (xproc-dev@w3.org from October 2024)

From: Erik Siegel <erik@xatapult.nl>
Date: Wed, 9 Oct 2024 15:03:11 +0200
To: "'Matthieu RICAUD-DUSSARGET'" <m.ricaud-dussarget@lefebvre-dalloz.fr>, <xproc-dev@w3.org>
Message-ID: <000601db1a4b$98ef4a90$cacddfb0$@xatapult.nl>
Be warned: the link you give below points to a very old, no longer maintained, xtpxlib repo with *Xproc 1.0* code.

 

From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr> 
Sent: Wednesday, 9 October 2024 10:20
To: xproc-dev@w3.org
Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive

 

Hi Erik, 

 

Thanks a lot for pointing your xtpxlib, it might be really usefull for me and others I guess.

Following xtpxlib home page I actually found the code for processing docx and Excel documents at : 

https://github.com/xatapult/xtpxlib/tree/master/ms-office

 

Thanks again for all the nice work on Xproc Erik !

 

Cheers, 

Matthieu Ricaud

De : Erik Siegel <erik@xatapult.nl <mailto:erik@xatapult.nl> > 
Envoyé : mercredi 9 octobre 2024 08:12
À : Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr <mailto:m.ricaud-dussarget@lefebvre-dalloz.fr> >; list.mu@c-moria.com <mailto:list.mu@c-moria.com> ; xproc-dev@w3.org <mailto:xproc-dev@w3.org> 
Objet : RE: Extract XML from docx file with xproc 3.0 p:unarchive

 

[Mail EXTERNE]: Vérifiez bien l’expéditeur de l’email avant de cliquer sur des liens ou pièces-jointes!

 

Hi Matthieu,

 

For what it’s worth: I have published open source ready-to-run XProc 3 code that interprets a .xlsx and .docx file into something more manageable: https://xoffice.xtpxlib.org/ <https://urldefense.com/v3/__https:/xoffice.xtpxlib.org/__;!!KEc074MNZw!Z5j2JFtTNHed1m7T1kw4yGgje36mkKYZlB9ZmNbWauCQB7lQeOsu_d-aFaPk633cWyUscujHBaIrhnjenQu4zovms7AKDJBQZT8$> 

 

Maybe that helps (even only as an example)

 

Erik Siegel

 

From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr <mailto:m.ricaud-dussarget@lefebvre-dalloz.fr> > 
Sent: Tuesday, 8 October 2024 23:00
To: list.mu@c-moria.com <mailto:list.mu@c-moria.com> ; xproc-dev@w3.org <mailto:xproc-dev@w3.org> 
Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive

 

Hi all, 

 

Thanks for your responses !

 

Christophe : yes the only docx file I have for the moment in the directory is valid : I can open it with 7-zip and get the xml document inside

It’s actually a .doc which I have converted with my MS Word (« save as  docx »)

 

Geert, thanks for all details, yes I might be interested with your script (though I have about 4 millions .docx to convert !) 

I’ll also have a look to Aspose.

 

About my pipeline, thanks to your help I find the problem which was .. so dummy !  

I forgot a $  before the docx.uri variable reference in <p:load href="{docx.uri}" … />

=> The href was empty, so I guess the fallback is to take the current xproc file as default and raides err:XC0085 "Cannot process document with media-type 'application/xproc+xml' as a ZIP archive"

Then when I added explicit binding to understand, then I specified (and force) the content-type …

I finally get an the err:XC0085 "Error processing ZIP archive: zip END header not found"  which made me more confused !

 

Sorry for that guys ! 

 

As for reminder for later my really simple xpl that works (display the xml inside the docx) : 

 

<p:declare-step xmlns:p=http://www.w3.org/ns/xproc <https://urldefense.com/v3/__http:/www.w3.org/ns/xproc__;!!KEc074MNZw!Z5j2JFtTNHed1m7T1kw4yGgje36mkKYZlB9ZmNbWauCQB7lQeOsu_d-aFaPk633cWyUscujHBaIrhnjenQu4zovms7AKOE5RJ8o$> 

  xmlns:c=http://www.w3.org/ns/xproc-step <https://urldefense.com/v3/__http:/www.w3.org/ns/xproc-step__;!!KEc074MNZw!Z5j2JFtTNHed1m7T1kw4yGgje36mkKYZlB9ZmNbWauCQB7lQeOsu_d-aFaPk633cWyUscujHBaIrhnjenQu4zovms7AKkaTQMU8$> 

  xmlns:xs=http://www.w3.org/2001/XMLSchema <https://urldefense.com/v3/__http:/www.w3.org/2001/XMLSchema__;!!KEc074MNZw!Z5j2JFtTNHed1m7T1kw4yGgje36mkKYZlB9ZmNbWauCQB7lQeOsu_d-aFaPk633cWyUscujHBaIrhnjenQu4zovms7AKT93Uy84$> 

  version="3.0">

  

  <p:input port="source" sequence="true"/>

  <p:output port="result" sequence="true"/>

  

  <p:option name="input-dir" select="resolve-uri('../../test/input-word', static-base-uri())" as="xs:string"/>

  

  <p:directory-list path="{$input-dir}" include-filter="\.docx$"/>

    

  <p:for-each>

    <p:with-input select="//c:file"/>

    <p:variable name="docx.uri" select="c:file/base-uri(.)"/>

    <p:identity message="Processing {$docx.uri}"/>

    <p:load href="{$docx.uri}"/>

    <p:unarchive format="zip" include-filter="word/document\.xml"/>

  </p:for-each>

 

</p:declare-step>

 

Cheers, 

Matthieu Ricaud
Received on Wednesday, 9 October 2024 13:04:20 UTC