W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2011

Re: Are there any open source tools that work with xproc that convert html to well formatted text?

From: Alex Muir <alex.g.muir@gmail.com>
Date: Thu, 3 Feb 2011 18:55:56 +0000
Message-ID: <AANLkTikFVG1fmEZeOaEy0nx1PZqo5nR7g2X6Nk8zkC5L@mail.gmail.com>
To: vojtech.toman@emc.com
Cc: xproc-dev@w3.org
I've tried implementing using a p:exec call to w3m within my pipline but
without success.

HTML text is first converted to xhtml to be an input for w3m

         <p:viewport match="/Document/html/content/chunk">
             <p:unescape-markup>
               <p:with-option name="content-type" select="'text/html'"/>
             </p:unescape-markup>
            </p:viewport>

resulting...
  <chunk>
           <html xmlns:html="http://www.w3.org/1999/xhtml">
                <head>
....

Then in theory the w3m would convert the xhtml to text however at the moment
it does nothing as the html content is left unchanged. I don't get any error
messages the process just keeps running following.

           <p:viewport match="/Documen/html/content/chunk/html">
              <p:exec command="/usr/libexec/w3m"
                source-is-xml="true"
                result-is-xml="false"
                wrap-result-lines="false">
              </p:exec>
            </p:viewport>

I tinkered with paths and other options as well after reading ....
http://xprocbook.com/book/refentry-16.html but always the same no change to
the content.

Anyway I don't really know if this should work at all in the first place but
seems possible... Any ideas?

Thanks much
Alex

On Tue, Feb 1, 2011 at 12:13 PM, Alex Muir <alex.g.muir@gmail.com> wrote:

> Ah well that led me to think about lnyx, w3m...
>
> w3m input.html > out.txt
>
> w3m does the job well.
>
> Regards
> Alex
>
>
> On Tue, Feb 1, 2011 at 11:24 AM,  <vojtech.toman@emc.com> wrote:
> > If you are in an *nix environment, you can also try using p:exec in
> combination with the lesspipe.sh script (a preprocessor filter for less
> capable of basic HTML "rendering"). But maybe I have completely
> misunderstood your requirement.
> >
> > Regards,
> > Vojtech
> >
> > --
> > Vojtech Toman
> > Consultant Software Engineer
> > EMC | Information Intelligence Group
> > vojtech.toman@emc.com
> > http://developer.emc.com/xmltech
> >
> >
> >> -----Original Message-----
> >> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On
> >> Behalf Of Alex Muir
> >> Sent: Tuesday, February 01, 2011 11:12 AM
> >> To: mozer
> >> Cc: XProc Dev
> >> Subject: Re: Are there any open source tools that work with xproc that
> >> convert html to well formatted text?
> >>
> >> I'm working with some html documents the style of which looks like say
> >> a straight forward word document which when I tried saving as text
> >> from firefox looked a lot like the HTML version in terms of the
> >> spacing of the text content,, except some tables which were garbage.
> >> So a subsection in the HTML was still easily determined to be a
> >> subsection in the text because the presentational formatting specified
> >> in the HTML was preserved in the text output.
> >>
> >> I've found more success thus far identifying the different textual
> >> elements of a text document than HTML perhaps because HTML has so many
> >> possibilities of layouts whereas text is pretty simple thing to parse
> >> out and identify where a table is or where a section, subsection is...
> >>
> >> Does that make sense regarding the well formatted?
> >>
> >> Alex
> >>
> >>
> >> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote:
> >> > oups read too fast : I read "well formed"
> >> >
> >> > What do you mean by well formatted text representation ?
> >> >
> >> > Xmlizer
> >> >
> >> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote:
> >> >> p:unescape-markup
> >> >> or
> >> >> p:http-request should do that
> >> >>
> >> >> Xmlizer
> >> >>
> >> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com>
> >> wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I'm interested to have a step in a pipeline that converts HTML to a
> >> >>> well formatted text representation.
> >> >>>
> >> >>> Are there any open source tools that do that that fit into xproc?
> >> >>>
> >> >>> Thanks
> >> >>>
> >> >>> --
> >> >>> Alex
> >> >>> -----
> >> >>> Currently:
> >> >>> Freelance Software Engineer 6+ yrs exp
> >> >>>
> >> >>> Previously:
> >> >>> https://sites.google.com/a/utg.edu.gm/alex/
> >> >>>
> >> >>>
> >> >>> A Bafila, is two rivers flowing together as one:
> >> >>> http://www.facebook.com/pages/Bafila/125611807494851
> >> >>>
> >> >>>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Alex
> >> -----
> >> Currently:
> >> Freelance Software Engineer 6+ yrs exp
> >>
> >> Previously:
> >> https://sites.google.com/a/utg.edu.gm/alex/
> >>
> >>
> >> A Bafila, is two rivers flowing together as one:
> >> http://www.facebook.com/pages/Bafila/125611807494851
> >>
> >
> >
> >
>
>
>
> --
> Alex
> -----
> Currently:
> Freelance Software Engineer 6+ yrs exp
>
> Previously:
> https://sites.google.com/a/utg.edu.gm/alex/
>
>
> A Bafila, is two rivers flowing together as one:
> http://www.facebook.com/pages/Bafila/125611807494851
>



-- 
Alex
-----
Currently:
Freelance Software Engineer 6+ yrs exp
 <http://www.facebook.com/pages/Bafila/125611807494851>
Previously:
https://sites.google.com/a/utg.edu.gm/alex/


A Bafila, is two rivers flowing together as one:
http://www.facebook.com/pages/Bafila/125611807494851
Received on Thursday, 3 February 2011 18:56:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 3 February 2011 18:56:30 GMT