Re: Two questions about grabbing "webpages" with p:load / p:document from Andy Bunce on 2025-01-23 (xproc-dev@w3.org from January 2025)

From: Andy Bunce <bunce.andy@gmail.com>
Date: Thu, 23 Jan 2025 22:51:37 +0000
To: "andy.carver@yahoo.com" <andy.carver@yahoo.com>
Cc: "xproc-dev@w3.org" <xproc-dev@w3.org>
Message-ID: <CAPH12r3E4XvDPTRsM-n___uCaND0Cm6WnTynM3d+T5qTizG1sA@mail.gmail.com>

Hi Andy,

Sympathy is good, and XProc on its own is not going to help with your
problem.

I have sometimes wanted to save HTML pages and ended up with only "the cake
stand", so I searched for a solution to this problem.
I found this https://stackoverflow.com/a/75659796/3210344 and it works for
me.

So, if like Wendell you can  sometimes live with a manual step, then this
could give you some full html local files to use as test data for XProc.
If it is worth automating  this step then I would look at something like
https://www.npmjs.com/package/puppeteer

/Andy


On Thu, 23 Jan 2025 at 19:08, andy.carver@yahoo.com <andy.carver@yahoo.com>
wrote:

> Thanks, guys. I appreciate the sympathy -- and the "cake stand" analogy
> was perfect!
>
> Cheers
> Andy
>
>
> On Thursday, January 23, 2025 at 09:51:46 AM MST, Piez, Wendell A. (Fed) <
> wendell.piez@nist.gov> wrote:
>
>
> Hello Andy,
>
>
>
> Your questions are good ones. When asking for a cake, we do not want a
> cake stand with a card on it saying “cake goes here”.
>
>
>
> On the other hand, the step you are asking for will have to include an
> Internet-connected HTML rendering platform with a Javascript engine and one
> that moreover reproduces the quirks of whatever browser you expect it to
> emulate. In other words, a headless browser that produces what … a
> serialization of a DOM? For the XProc engine.
>
>
>
> BTW, the problem also occurs with PDF. Even if you can save the PDF
> locally, seeing what’s in it is a different matter. I have a data
> conversion pipeline in which the first step must be done by hand, using a
> commercial tool to create an HTML export of the PDF source, which XProc can
> then work with. This gets me what I need but it is not scalable
> (fortunately not a requirement this time) due to the wide range of quirky
> (and broken) HTML/CSS one sees.
>
>
>
> In the case of web resources that are not delivered as ‘pages’, but
> instead as little (or big) computer programs, I don’t believe there is a
> general solution – or I should say, any ‘general’ will be only relatively
> general or somewhat general. But I’m also not sure it would be a good thing
> if there were.
>
>
>
> It’s certainly a problem and limitation to keep in mind. XProc’s p:load
> will be more like curl than it will be like a web browser.
>
>
>
> It’s an important observation: thanks for posting.
>
>
>
> Regards, Wendell
>
>
>
> *From:* andy.carver@yahoo.com <andy.carver@yahoo.com>
> *Sent:* Tuesday, January 21, 2025 6:46 PM
> *To:* xproc-dev@w3.org
> *Subject:* Two questions about grabbing "webpages" with p:load /
> p:document
>
>
>
> Hi folks,
>
>
> <snip>
>

Received on Thursday, 23 January 2025 22:51:52 UTC