Re: Extract XML from docx file with xproc 3.0 p:unarchive from Andrew Sales on 2024-10-09 (xproc-dev@w3.org from October 2024)

From: Andrew Sales <andrew@andrewsales.com>
Date: Wed, 9 Oct 2024 16:47:33 +0100
To: "Piez, Wendell A. (Fed)" <wendell.piez@nist.gov>
Cc: XProc Dev <xproc-dev@w3.org>
Message-ID: <CAGD-QPzh4Bt1uWDxCicdS59CcBi6EmXCVooYh=sxO7n=MjscnQ@mail.gmail.com>
Hello,

Great stuff - anything that helps with converting OOXML is a boon.

As a Schematronist with a re-kindled interest in XProc (wannabe
XProcker??), I've taken the liberty of adding links to Matthieu and
Wendell's excellent work to the Awesome Schematron repository[1].
Just let me know if how they are referred to there should be changed in any
way.

Thanks,
Andrew

[1]
https://github.com/Schematron/awesome-schematron?tab=readme-ov-file#applications

On Wed, 9 Oct 2024 at 15:31, Piez, Wendell A. (Fed) <wendell.piez@nist.gov>
wrote:

> Matthieu:
>
>
>
> You make an excellent point about running Schematron over the XProc. I am
> doing the same in my project at https://github.com/usnistgov/oscal-xproc3
> - saving me many, many hours debugging.
>
>
>
> The Schematrons are also applied to the XProc files under CI/CD, i.e.
> whenever they are pushed into the repository. Using Morgana under Github
> actions. Another pipeline under CI/CD runs XSpec test suites in XProc 3.0..
> Everything is public domain / open source.
>
>
>
> So yes! There are lots of good ideas out there … I’ve lifted most of mine
> from elsewhere. 😊
>
>
>
> Although I’m not sure I’ll be using the “./” trick except *in extremis*….
>
>
>
> Can I have a Schematron to warn me when I am sending email to you or Geert
> when I mean to write to the list? (Don’t look now, the AIs are coming.)
>
>
>
> Regards, Wendell
>
>
>
> *From:* Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr>
> *Sent:* Wednesday, October 9, 2024 3:28 AM
> *To:* Piez, Wendell A. (Fed) <wendell.piez@nist.gov>
> *Subject:* RE: Extract XML from docx file with xproc 3.0 p:unarchive
>
>
>
> Hi Wendel,
>
>
>
> Thanks for your response. Yes I guess I’ll had a try catch on the whole
> process, especially the XSLT which might crash depending on the word
> content.
>
>
>
> Using href="./{expr}" looks a bit strange, but I would have seen my
> mistake before ;)
>
> While coding in XSLT within Oxygen, I get a warning when using a variable
> name without $, this is a schematron control.
>
> I think a good IDE for developing xproc might help avoiding such typos.
>
> I did develop a schematron to control XSLT quality (
> https://github.com/mricaud/xslt-quality) maybe doing the same with Xproc
> might help !
>
>
>
> Thanks for your feedback and good ideas !
>
>
>
>
>
> *Cheers*
>
> *Matthieu Ricaud*
>
> *De :* Piez, Wendell A. (Fed) <wendell.piez@nist.gov>
> *Envoyé :* mardi 8 octobre 2024 23:55
> *À :* Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr>
> *Objet :* RE: Extract XML from docx file with xproc 3.0 p:unarchive
>
>
>
> [Mail EXTERNE]: Vérifiez bien l’expéditeur de l’email avant de cliquer
> sur des liens ou pièces-jointes!
>
>
>
> Matthieu -- oops!
>
>
>
> I wrote also to suggest try/catch for you, but it appears the email went
> only to Geert. (Sorry Geert.)
>
>
>
> Probably not the last time – and of course it wouldn’t have solved this
> problem, only helped to mitigate similar problems caused by actual errors
> in inputs, not errors in the code.
>
>
>
> For that matter I have also been bitten by the fallback to read the XProc
> at path “”.
>
>
>
> One thing that occurred to me would be to prepend any href to a relative
> path:
>
>
>
> <p:load href="./{expr}"/>
>
>
>
> And this does error out (in Morgana) if ‘expr’ evaluates to the empty
> string.
>
>
>
> But part of me says this is bad form (it feels strange and awkward), and I
> should just rely on runtime messaging to expose the values for debugging.
>
>
>
> Comments?
>
>
>
> Regards, Wendell
>
>
>
> *From:* Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr>
> *Sent:* Tuesday, October 8, 2024 5:00 PM
> *To:* list.mu@c-moria.com; xproc-dev@w3.org
> *Subject:* RE: Extract XML from docx file with xproc 3.0 p:unarchive
>
>
>
> Hi all,
>
>
>
> Thanks for your responses !
>
>
>
> Christophe : yes the only docx file I have for the moment in the directory
> is valid : I can open it with 7-zip and get the xml document inside
>
> It’s actually a .doc which I have converted with my MS Word (« save as
> docx »)
>
>
>
> Geert, thanks for all details, yes I might be interested with your script
> (though I have about 4 millions .docx to convert !)
>
> I’ll also have a look to Aspose.
>
>
>
> About my pipeline, thanks to your help I find the problem which was .. so
> dummy !
>
> I forgot a $  before the docx.uri variable reference in <p:load
> href="{docx.uri}" … />
>
> => The href was empty, so I guess the fallback is to take the current
> xproc file as default and raides err:XC0085 "Cannot process document with
> media-type 'application/xproc+xml' as a ZIP archive"
>
> Then when I added explicit binding to understand, then I specified (and
> force) the content-type …
>
> I finally get an the err:XC0085 "Error processing ZIP archive: zip END
> header not found"  which made me more confused !
>
>
>
> Sorry for that guys !
>
>
>
> As for reminder for later my really simple xpl that works (display the xml
> inside the docx) :
>
>
>
> <p:declare-step xmlns:p=http://www.w3.org/ns/xproc
> <https://urldefense.com/v3/__http:/www.w3.org/ns/xproc__;!!KEc074MNZw!bdGqb4wbmFbglhcs25hQR8f7Qnyrvexr4Xx8DlazA8WdiuL_0jOFiHxG4XGuGua1DtncTQO3DfW0dKtW16SAa0_sJTYxUk95g6FGku-UMs0$>
>
>   xmlns:c=http://www.w3.org/ns/xproc-step
> <https://urldefense.com/v3/__http:/www.w3.org/ns/xproc-step__;!!KEc074MNZw!bdGqb4wbmFbglhcs25hQR8f7Qnyrvexr4Xx8DlazA8WdiuL_0jOFiHxG4XGuGua1DtncTQO3DfW0dKtW16SAa0_sJTYxUk95g6FGz5c7UoE$>
>
>   xmlns:xs=http://www.w3.org/2001/XMLSchema
> <https://urldefense.com/v3/__http:/www.w3.org/2001/XMLSchema__;!!KEc074MNZw!bdGqb4wbmFbglhcs25hQR8f7Qnyrvexr4Xx8DlazA8WdiuL_0jOFiHxG4XGuGua1DtncTQO3DfW0dKtW16SAa0_sJTYxUk95g6FGru9TTl8$>
>
>   version="3.0">
>
>
>
>   <p:input port="source" sequence="true"/>
>
>   <p:output port="result" sequence="true"/>
>
>
>
>   <p:option name="input-dir" select="resolve-uri('../../test/input-word',
> static-base-uri())" as="xs:string"/>
>
>
>
>   <p:directory-list path="{$input-dir}" include-filter="\.docx$"/>
>
>
>
>   <p:for-each>
>
>     <p:with-input select="//c:file"/>
>
>     <p:variable name="docx.uri" select="c:file/base-uri(.)"/>
>
>     <p:identity message="Processing {$docx.uri}"/>
>
>     <p:load href="{$docx.uri}"/>
>
>     <p:unarchive format="zip" include-filter="word/document\.xml"/>
>
>   </p:for-each>
>
>
>
> </p:declare-step>
>
>
>
> *Cheers,*
>
> *Matthieu Ricaud*
>
>
>
Received on Thursday, 10 October 2024 10:21:27 UTC