RE: Two questions about grabbing "webpages" with p:load / p:document

Hello Andy,

Your questions are good ones. When asking for a cake, we do not want a cake stand with a card on it saying "cake goes here".

On the other hand, the step you are asking for will have to include an Internet-connected HTML rendering platform with a Javascript engine and one that moreover reproduces the quirks of whatever browser you expect it to emulate. In other words, a headless browser that produces what ... a serialization of a DOM? For the XProc engine.

BTW, the problem also occurs with PDF. Even if you can save the PDF locally, seeing what's in it is a different matter. I have a data conversion pipeline in which the first step must be done by hand, using a commercial tool to create an HTML export of the PDF source, which XProc can then work with. This gets me what I need but it is not scalable (fortunately not a requirement this time) due to the wide range of quirky (and broken) HTML/CSS one sees.

In the case of web resources that are not delivered as 'pages', but instead as little (or big) computer programs, I don't believe there is a general solution - or I should say, any 'general' will be only relatively general or somewhat general. But I'm also not sure it would be a good thing if there were.

It's certainly a problem and limitation to keep in mind. XProc's p:load will be more like curl than it will be like a web browser.

It's an important observation: thanks for posting.

Regards, Wendell

From: andy.carver@yahoo.com <andy.carver@yahoo.com>
Sent: Tuesday, January 21, 2025 6:46 PM
To: xproc-dev@w3.org
Subject: Two questions about grabbing "webpages" with p:load / p:document

Hi folks,

I've got two questions about p:load's ability to capture "webpages" (p:document seems to perform the same).

These questions are in the context of a divergence (in the general case), between what actually results from the step in the pipeline (viz., the raw HTML as delivered by the web-server), and what one sees in the browser window (i.e., the completed, resulting DOM tree). That is to say, this is a divergence I see currently in Morgana XProcIII and in XML Calabash 3.x.

(I imply no disparagement at all, by pointing to this divergence; I'm really quite impressed that this step handles even HTTPS -- getting and de-crypting the HTML document.)

So then,

1. Is this (bare-bones HTML) output from the p:load step actually all the spec (or other XProc documentation) has in mind, when speaking of the ability of p:load (or p:document) to retrieve documents (i.e. "webpages") from the Web?

2. Assuming that's all that the spec has in mind -- but also, that what the user might desire (or even expect) is a whole lot more, when visions of getting-and-loading "webpages" fills one's head -- is there so to speak, some other link one can add to the chain (even, say, some NPM package one calls from command line), some other step in one's pipeline perhaps, that will perform, or get a browser to perform, the DOM-completion (including any AJAX calls), and return the actual HTML that is built within and displayed by a web-browser?

A simple example:

Say one wants to process, as XML, some data about Wyoming state legislators, as seen at https://www.wyoleg.gov/Legislators/2025/H . The naive newbie (such as myself?) might expect all this lovely data will be captured by p:load -- for lovely XML processing -- and be chagrined to discover that the HTML captured is so elementary that when opened in a web-browser it shows a blank, white screen. For all the lovely data is not in the HTML served -- not, that is, until some AJAX queri(es) retrieve it from the server and add it to the DOM.

I will mention that I'm a Windows user. So I apologize, if the answer to 2. is kindergarten stuff to Linux gurus :D In any case, I'm hoping for a solution that will work (eventually) in Windows.

Many thanks,

Andy

Received on Thursday, 23 January 2025 16:51:50 UTC