Re: Two questions about grabbing "webpages" with p:load / p:document from andy.carver@yahoo.com on 2025-01-23 (xproc-dev@w3.org from January 2025)

From: <andy.carver@yahoo.com>
Date: Thu, 23 Jan 2025 19:07:55 +0000 (UTC)
To: "xproc-dev@w3.org" <xproc-dev@w3.org>
Message-ID: <1293586977.1985606.1737659275620@mail.yahoo.com>

 Thanks, guys. I appreciate the sympathy -- and the "cake stand" analogy was perfect!
CheersAndy

    On Thursday, January 23, 2025 at 09:51:46 AM MST, Piez, Wendell A. (Fed) <wendell.piez@nist.gov> wrote:  
 
 
Hello Andy,
 
  
 
Your questions are good ones. When asking for a cake, we do not want a cake stand with a card on it saying “cake goes here”.
 
  
 
On the other hand, the step you are asking for will have to include an Internet-connected HTML rendering platform with a Javascript engine and one that moreover reproduces the quirks of whatever browser you expect it to emulate. In other words, a headless browser that produces what … a serialization of a DOM? For the XProc engine.
 
  
 
BTW, the problem also occurs with PDF. Even if you can save the PDF locally, seeing what’s in it is a different matter. I have a data conversion pipeline in which the first step must be done by hand, using a commercial tool to create an HTML export of the PDF source, which XProc can then work with. This gets me what I need but it is not scalable (fortunately not a requirement this time) due to the wide range of quirky (and broken) HTML/CSS one sees.
 
  
 
In the case of web resources that are not delivered as ‘pages’, but instead as little (or big) computer programs, I don’t believe there is a general solution – or I should say, any ‘general’ will be only relatively general or somewhat general. But I’m also not sure it would be a good thing if there were.
 
  
 
It’s certainly a problem and limitation to keep in mind. XProc’s p:load will be more like curl than it will be like a web browser.
 
  
 
It’s an important observation: thanks for posting.
 
  
 
Regards, Wendell
 
  
 
From: andy.carver@yahoo.com <andy.carver@yahoo.com> 
Sent: Tuesday, January 21, 2025 6:46 PM
To: xproc-dev@w3.org
Subject: Two questions about grabbing "webpages" with p:load / p:document
 
  
 
Hi folks,
 
  
 <snip>

Received on Thursday, 23 January 2025 19:08:01 UTC