- From: Norman Walsh <ndw@nwalsh.com>
- Date: Fri, 01 May 2009 09:05:34 -0400
- To: XProc Dev <xproc-dev@w3.org>
- Message-ID: <m28wlhdlcx.fsf@nwalsh.com>
Philip Fennell <Philip.Fennell@bbc.co.uk> writes: > That's exactly what I'm not trying to do. I'm wanting to invoke Tidy on > an HTML document that is not well-formed XML so that I can do further > processing on it. I understand that use case, but I don't understand what you plan to pass *to* your tidy:html step. The things that appear on p:input ports *must* be well-formed XML. > Therefore I need to use p:data to get hold of a > non-XML document. My problem is that p:data, and p:document for that > matter, do not allow you to use p:with-option so that you can use an > expression (XPath) instead of a string literal (URI). > > I can get around the p:document problem by using p:load, but I cannot > see an equivalent for this particular use-case; and I imagine it will be > a popular use-case if people want to use XProc to build legacy-content > conversion pipelines where they may have large amounts of pages in > HTML/SGML or do screen-scrapping off of existing web sites. The workaround is p:http-request, even though that's inelegant in some ways. From 2.2.2 Non-XML Documents: It is not a goal of XProc that it should be a general-purpose pipeline language for manipulating arbitrary, non-XML resources. There are two standard ways that a non-XML document may enter a pipeline: directly through p:data or as the result of performing an p:http-request step. Loading non-XML data with a computed URI requires the p:http-request step. Implementors are encouraged to support the file: URI scheme so that users can load local data from computed URIs. So, if you have the computed URI of a document in $uri, you can load it with p:http-request: <p:http-request method="get"> <p:with-option name="href" select="$uri"/> </p:http-request> Implementors are encouraged to make that work for file: URIs as well as http(s): URIs. XML Calabash supports it. Be seeing you, norm P.S. What, you may ask, possessed the WG to use p:*HTTP*-request to load URIs from file: URIs? On the one hand, adding another step to load from file: URIs would have largely reproduced the p:http-request step, and on the other, there wasn't any obviously better name for p:http-request. -- Norman Walsh <ndw@nwalsh.com> | One stops being a child when one http://nwalsh.com/ | realizes that telling one's trouble | does not make it better.--Cesare Pavese
Received on Friday, 1 May 2009 13:06:18 UTC