- From: Alex Muir <alex.g.muir@gmail.com>
- Date: Thu, 3 Feb 2011 18:55:56 +0000
- To: vojtech.toman@emc.com
- Cc: xproc-dev@w3.org
- Message-ID: <AANLkTikFVG1fmEZeOaEy0nx1PZqo5nR7g2X6Nk8zkC5L@mail.gmail.com>
I've tried implementing using a p:exec call to w3m within my pipline but
without success.
HTML text is first converted to xhtml to be an input for w3m
<p:viewport match="/Document/html/content/chunk">
<p:unescape-markup>
<p:with-option name="content-type" select="'text/html'"/>
</p:unescape-markup>
</p:viewport>
resulting...
<chunk>
<html xmlns:html="http://www.w3.org/1999/xhtml">
<head>
....
Then in theory the w3m would convert the xhtml to text however at the moment
it does nothing as the html content is left unchanged. I don't get any error
messages the process just keeps running following.
<p:viewport match="/Documen/html/content/chunk/html">
<p:exec command="/usr/libexec/w3m"
source-is-xml="true"
result-is-xml="false"
wrap-result-lines="false">
</p:exec>
</p:viewport>
I tinkered with paths and other options as well after reading ....
http://xprocbook.com/book/refentry-16.html but always the same no change to
the content.
Anyway I don't really know if this should work at all in the first place but
seems possible... Any ideas?
Thanks much
Alex
On Tue, Feb 1, 2011 at 12:13 PM, Alex Muir <alex.g.muir@gmail.com> wrote:
> Ah well that led me to think about lnyx, w3m...
>
> w3m input.html > out.txt
>
> w3m does the job well.
>
> Regards
> Alex
>
>
> On Tue, Feb 1, 2011 at 11:24 AM, <vojtech.toman@emc.com> wrote:
> > If you are in an *nix environment, you can also try using p:exec in
> combination with the lesspipe.sh script (a preprocessor filter for less
> capable of basic HTML "rendering"). But maybe I have completely
> misunderstood your requirement.
> >
> > Regards,
> > Vojtech
> >
> > --
> > Vojtech Toman
> > Consultant Software Engineer
> > EMC | Information Intelligence Group
> > vojtech.toman@emc.com
> > http://developer.emc.com/xmltech
> >
> >
> >> -----Original Message-----
> >> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On
> >> Behalf Of Alex Muir
> >> Sent: Tuesday, February 01, 2011 11:12 AM
> >> To: mozer
> >> Cc: XProc Dev
> >> Subject: Re: Are there any open source tools that work with xproc that
> >> convert html to well formatted text?
> >>
> >> I'm working with some html documents the style of which looks like say
> >> a straight forward word document which when I tried saving as text
> >> from firefox looked a lot like the HTML version in terms of the
> >> spacing of the text content,, except some tables which were garbage.
> >> So a subsection in the HTML was still easily determined to be a
> >> subsection in the text because the presentational formatting specified
> >> in the HTML was preserved in the text output.
> >>
> >> I've found more success thus far identifying the different textual
> >> elements of a text document than HTML perhaps because HTML has so many
> >> possibilities of layouts whereas text is pretty simple thing to parse
> >> out and identify where a table is or where a section, subsection is...
> >>
> >> Does that make sense regarding the well formatted?
> >>
> >> Alex
> >>
> >>
> >> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote:
> >> > oups read too fast : I read "well formed"
> >> >
> >> > What do you mean by well formatted text representation ?
> >> >
> >> > Xmlizer
> >> >
> >> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote:
> >> >> p:unescape-markup
> >> >> or
> >> >> p:http-request should do that
> >> >>
> >> >> Xmlizer
> >> >>
> >> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com>
> >> wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I'm interested to have a step in a pipeline that converts HTML to a
> >> >>> well formatted text representation.
> >> >>>
> >> >>> Are there any open source tools that do that that fit into xproc?
> >> >>>
> >> >>> Thanks
> >> >>>
> >> >>> --
> >> >>> Alex
> >> >>> -----
> >> >>> Currently:
> >> >>> Freelance Software Engineer 6+ yrs exp
> >> >>>
> >> >>> Previously:
> >> >>> https://sites.google.com/a/utg.edu.gm/alex/
> >> >>>
> >> >>>
> >> >>> A Bafila, is two rivers flowing together as one:
> >> >>> http://www.facebook.com/pages/Bafila/125611807494851
> >> >>>
> >> >>>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Alex
> >> -----
> >> Currently:
> >> Freelance Software Engineer 6+ yrs exp
> >>
> >> Previously:
> >> https://sites.google.com/a/utg.edu.gm/alex/
> >>
> >>
> >> A Bafila, is two rivers flowing together as one:
> >> http://www.facebook.com/pages/Bafila/125611807494851
> >>
> >
> >
> >
>
>
>
> --
> Alex
> -----
> Currently:
> Freelance Software Engineer 6+ yrs exp
>
> Previously:
> https://sites.google.com/a/utg.edu.gm/alex/
>
>
> A Bafila, is two rivers flowing together as one:
> http://www.facebook.com/pages/Bafila/125611807494851
>
--
Alex
-----
Currently:
Freelance Software Engineer 6+ yrs exp
<http://www.facebook.com/pages/Bafila/125611807494851>
Previously:
https://sites.google.com/a/utg.edu.gm/alex/
A Bafila, is two rivers flowing together as one:
http://www.facebook.com/pages/Bafila/125611807494851
Received on Thursday, 3 February 2011 18:56:29 UTC