- From: Piez, Wendell A. (Fed) <wendell.piez@nist.gov>
- Date: Wed, 9 Oct 2024 14:30:38 +0000
- To: XProc Dev <xproc-dev@w3.org>
- Message-ID: <SA9PR09MB5824905F51E747F435199771FF7F2@SA9PR09MB5824.namprd09.prod.outlook.com>
Matthieu: You make an excellent point about running Schematron over the XProc. I am doing the same in my project at https://github.com/usnistgov/oscal-xproc3 - saving me many, many hours debugging. The Schematrons are also applied to the XProc files under CI/CD, i.e. whenever they are pushed into the repository. Using Morgana under Github actions. Another pipeline under CI/CD runs XSpec test suites in XProc 3.0. Everything is public domain / open source. So yes! There are lots of good ideas out there … I’ve lifted most of mine from elsewhere. 😊 Although I’m not sure I’ll be using the “./” trick except *in extremis*…. Can I have a Schematron to warn me when I am sending email to you or Geert when I mean to write to the list? (Don’t look now, the AIs are coming.) Regards, Wendell From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr> Sent: Wednesday, October 9, 2024 3:28 AM To: Piez, Wendell A. (Fed) <wendell.piez@nist.gov> Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive Hi Wendel, Thanks for your response. Yes I guess I’ll had a try catch on the whole process, especially the XSLT which might crash depending on the word content. Using href="./{expr}" looks a bit strange, but I would have seen my mistake before ;) While coding in XSLT within Oxygen, I get a warning when using a variable name without $, this is a schematron control. I think a good IDE for developing xproc might help avoiding such typos. I did develop a schematron to control XSLT quality (https://github.com/mricaud/xslt-quality) maybe doing the same with Xproc might help ! Thanks for your feedback and good ideas ! Cheers Matthieu Ricaud De : Piez, Wendell A. (Fed) <wendell.piez@nist.gov<mailto:wendell.piez@nist.gov>> Envoyé : mardi 8 octobre 2024 23:55 À : Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr<mailto:m.ricaud-dussarget@lefebvre-dalloz.fr>> Objet : RE: Extract XML from docx file with xproc 3.0 p:unarchive [Mail EXTERNE]: Vérifiez bien l’expéditeur de l’email avant de cliquer sur des liens ou pièces-jointes! Matthieu -- oops! I wrote also to suggest try/catch for you, but it appears the email went only to Geert. (Sorry Geert.) Probably not the last time – and of course it wouldn’t have solved this problem, only helped to mitigate similar problems caused by actual errors in inputs, not errors in the code. For that matter I have also been bitten by the fallback to read the XProc at path “”. One thing that occurred to me would be to prepend any href to a relative path: <p:load href="./{expr}"/> And this does error out (in Morgana) if ‘expr’ evaluates to the empty string. But part of me says this is bad form (it feels strange and awkward), and I should just rely on runtime messaging to expose the values for debugging. Comments? Regards, Wendell From: Matthieu RICAUD-DUSSARGET <m.ricaud-dussarget@lefebvre-dalloz.fr<mailto:m.ricaud-dussarget@lefebvre-dalloz.fr>> Sent: Tuesday, October 8, 2024 5:00 PM To: list.mu@c-moria.com<mailto:list.mu@c-moria.com>; xproc-dev@w3.org<mailto:xproc-dev@w3.org> Subject: RE: Extract XML from docx file with xproc 3.0 p:unarchive Hi all, Thanks for your responses ! Christophe : yes the only docx file I have for the moment in the directory is valid : I can open it with 7-zip and get the xml document inside It’s actually a .doc which I have converted with my MS Word (« save as docx ») Geert, thanks for all details, yes I might be interested with your script (though I have about 4 millions .docx to convert !) I’ll also have a look to Aspose. About my pipeline, thanks to your help I find the problem which was .. so dummy ! I forgot a $ before the docx.uri variable reference in <p:load href="{docx.uri}" … /> => The href was empty, so I guess the fallback is to take the current xproc file as default and raides err:XC0085 "Cannot process document with media-type 'application/xproc+xml' as a ZIP archive" Then when I added explicit binding to understand, then I specified (and force) the content-type … I finally get an the err:XC0085 "Error processing ZIP archive: zip END header not found" which made me more confused ! Sorry for that guys ! As for reminder for later my really simple xpl that works (display the xml inside the docx) : <p:declare-step xmlns:p=http://www.w3.org/ns/xproc<https://urldefense.com/v3/__http:/www.w3.org/ns/xproc__;!!KEc074MNZw!bdGqb4wbmFbglhcs25hQR8f7Qnyrvexr4Xx8DlazA8WdiuL_0jOFiHxG4XGuGua1DtncTQO3DfW0dKtW16SAa0_sJTYxUk95g6FGku-UMs0$> xmlns:c=http://www.w3.org/ns/xproc-step<https://urldefense.com/v3/__http:/www.w3.org/ns/xproc-step__;!!KEc074MNZw!bdGqb4wbmFbglhcs25hQR8f7Qnyrvexr4Xx8DlazA8WdiuL_0jOFiHxG4XGuGua1DtncTQO3DfW0dKtW16SAa0_sJTYxUk95g6FGz5c7UoE$> xmlns:xs=http://www.w3.org/2001/XMLSchema<https://urldefense.com/v3/__http:/www.w3.org/2001/XMLSchema__;!!KEc074MNZw!bdGqb4wbmFbglhcs25hQR8f7Qnyrvexr4Xx8DlazA8WdiuL_0jOFiHxG4XGuGua1DtncTQO3DfW0dKtW16SAa0_sJTYxUk95g6FGru9TTl8$> version="3.0"> <p:input port="source" sequence="true"/> <p:output port="result" sequence="true"/> <p:option name="input-dir" select="resolve-uri('../../test/input-word', static-base-uri())" as="xs:string"/> <p:directory-list path="{$input-dir}" include-filter="\.docx$"/> <p:for-each> <p:with-input select="//c:file"/> <p:variable name="docx.uri" select="c:file/base-uri(.)"/> <p:identity message="Processing {$docx.uri}"/> <p:load href="{$docx.uri}"/> <p:unarchive format="zip" include-filter="word/document\.xml"/> </p:for-each> </p:declare-step> Cheers, Matthieu Ricaud
Received on Wednesday, 9 October 2024 14:30:50 UTC