- From: <andy.carver@yahoo.com>
- Date: Thu, 23 Jan 2025 19:07:55 +0000 (UTC)
- To: "xproc-dev@w3.org" <xproc-dev@w3.org>
- Message-ID: <1293586977.1985606.1737659275620@mail.yahoo.com>
Thanks, guys. I appreciate the sympathy -- and the "cake stand" analogy was perfect! CheersAndy On Thursday, January 23, 2025 at 09:51:46 AM MST, Piez, Wendell A. (Fed) <wendell.piez@nist.gov> wrote: Hello Andy, Your questions are good ones. When asking for a cake, we do not want a cake stand with a card on it saying “cake goes here”. On the other hand, the step you are asking for will have to include an Internet-connected HTML rendering platform with a Javascript engine and one that moreover reproduces the quirks of whatever browser you expect it to emulate. In other words, a headless browser that produces what … a serialization of a DOM? For the XProc engine. BTW, the problem also occurs with PDF. Even if you can save the PDF locally, seeing what’s in it is a different matter. I have a data conversion pipeline in which the first step must be done by hand, using a commercial tool to create an HTML export of the PDF source, which XProc can then work with. This gets me what I need but it is not scalable (fortunately not a requirement this time) due to the wide range of quirky (and broken) HTML/CSS one sees. In the case of web resources that are not delivered as ‘pages’, but instead as little (or big) computer programs, I don’t believe there is a general solution – or I should say, any ‘general’ will be only relatively general or somewhat general. But I’m also not sure it would be a good thing if there were. It’s certainly a problem and limitation to keep in mind. XProc’s p:load will be more like curl than it will be like a web browser. It’s an important observation: thanks for posting. Regards, Wendell From: andy.carver@yahoo.com <andy.carver@yahoo.com> Sent: Tuesday, January 21, 2025 6:46 PM To: xproc-dev@w3.org Subject: Two questions about grabbing "webpages" with p:load / p:document Hi folks, <snip>
Received on Thursday, 23 January 2025 19:08:01 UTC