- From: Andy Bunce <bunce.andy@gmail.com>
- Date: Thu, 23 Jan 2025 22:51:37 +0000
- To: "andy.carver@yahoo.com" <andy.carver@yahoo.com>
- Cc: "xproc-dev@w3.org" <xproc-dev@w3.org>
- Message-ID: <CAPH12r3E4XvDPTRsM-n___uCaND0Cm6WnTynM3d+T5qTizG1sA@mail.gmail.com>
Hi Andy, Sympathy is good, and XProc on its own is not going to help with your problem. I have sometimes wanted to save HTML pages and ended up with only "the cake stand", so I searched for a solution to this problem. I found this https://stackoverflow.com/a/75659796/3210344 and it works for me. So, if like Wendell you can sometimes live with a manual step, then this could give you some full html local files to use as test data for XProc. If it is worth automating this step then I would look at something like https://www.npmjs.com/package/puppeteer /Andy On Thu, 23 Jan 2025 at 19:08, andy.carver@yahoo.com <andy.carver@yahoo.com> wrote: > Thanks, guys. I appreciate the sympathy -- and the "cake stand" analogy > was perfect! > > Cheers > Andy > > > On Thursday, January 23, 2025 at 09:51:46 AM MST, Piez, Wendell A. (Fed) < > wendell.piez@nist.gov> wrote: > > > Hello Andy, > > > > Your questions are good ones. When asking for a cake, we do not want a > cake stand with a card on it saying “cake goes here”. > > > > On the other hand, the step you are asking for will have to include an > Internet-connected HTML rendering platform with a Javascript engine and one > that moreover reproduces the quirks of whatever browser you expect it to > emulate. In other words, a headless browser that produces what … a > serialization of a DOM? For the XProc engine. > > > > BTW, the problem also occurs with PDF. Even if you can save the PDF > locally, seeing what’s in it is a different matter. I have a data > conversion pipeline in which the first step must be done by hand, using a > commercial tool to create an HTML export of the PDF source, which XProc can > then work with. This gets me what I need but it is not scalable > (fortunately not a requirement this time) due to the wide range of quirky > (and broken) HTML/CSS one sees. > > > > In the case of web resources that are not delivered as ‘pages’, but > instead as little (or big) computer programs, I don’t believe there is a > general solution – or I should say, any ‘general’ will be only relatively > general or somewhat general. But I’m also not sure it would be a good thing > if there were. > > > > It’s certainly a problem and limitation to keep in mind. XProc’s p:load > will be more like curl than it will be like a web browser. > > > > It’s an important observation: thanks for posting. > > > > Regards, Wendell > > > > *From:* andy.carver@yahoo.com <andy.carver@yahoo.com> > *Sent:* Tuesday, January 21, 2025 6:46 PM > *To:* xproc-dev@w3.org > *Subject:* Two questions about grabbing "webpages" with p:load / > p:document > > > > Hi folks, > > > <snip> >
Received on Thursday, 23 January 2025 22:51:52 UTC