W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2011

Re: Are there any open source tools that work with xproc that convert html to well formatted text?

From: Alex Muir <alex.g.muir@gmail.com>
Date: Fri, 18 Feb 2011 17:23:34 +0000
Message-ID: <AANLkTi=oL-gDO4T_GZTfc_rSVMFUg0HWuK+LiVkYc9hQ@mail.gmail.com>
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
Cc: vojtech.toman@emc.com, xproc-dev@w3.org
Well as it turns out the problem was not that the w3m was not working in the
exec, it was that the viewport match was not matching on the xpath I
supplied because the previous step <p:unescape-markup> adds the xhtml
namespace and thus the xpath I had /Document/html/content/chunk/html  for
the viewport match didn't match and well I wasn't aware.

I'm not certain that the error as specified in the spec "It is a *dynamic
error <http://www.w3.org/TR/xproc/#dt-dynamic-error>*
(err:XD0010<http://www.w3.org/TR/xproc/#err.D0010>)
if the match expression on p:viewport does not match an element or
document." is working in Calabash. I wasn't getting an error, just kept on
trucking.

The 2 errors in the viewport.java are as follows

       if (match == null || "".equals(match)) {
            error(node, "Match expression on p:viewport must be specified.",
XProcConstants.staticError(38));
            valid = false;
        }

        if (outputs.size() == 1) {
            error(node, "A viewport step must have a primary output",
XProcConstants.staticError(6));
        }


So this worked finally.

             <p:viewport
match="/document/html/content/chunk/*[namespace-uri()='
http://www.w3.org/1999/xhtml' and  local-name()='html']">

              <p:exec name="exexHTML2Text" command="/usr/bin/w3m"
source-is-xml="false"
                result-is-xml="false" wrap-result-lines="true" args="-T
text/html"/>

            </p:viewport>


On Mon, Feb 7, 2011 at 9:08 PM, Henry S. Thompson <ht@inf.ed.ac.uk> wrote:

> Brief experiment with
>
>  > echo "<html><body><div>foo</div></body></html>"|w3m
>
> suggests you need -T text/html, not -dump.
>
> ht
> --
>       Henry S. Thompson, School of Informatics, University of Edinburgh
>      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
>                Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk
>                       URL: http://www.ltg.ed.ac.uk/~ht/
>  [mail from me _always_ has a .sig like this -- mail without it is forged
> spam]
>



-- 
Alex
-----
Currently:
Freelance Software Engineer 6+ yrs exp
 <http://www.facebook.com/pages/Bafila/125611807494851>
Previously:
https://sites.google.com/a/utg.edu.gm/alex/


A Bafila, is two rivers flowing together as one:
http://www.facebook.com/pages/Bafila/125611807494851
Received on Friday, 18 February 2011 17:24:12 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 18 February 2011 17:24:13 GMT