W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2011

Re: Are there any open source tools that work with xproc that convert html to well formatted text?

From: Alex Muir <alex.g.muir@gmail.com>
Date: Mon, 7 Feb 2011 16:32:06 +0000
Message-ID: <AANLkTi=5xe-jwey=1WLGDmUA7kr3p9B=r2xx3VU=ZMw_@mail.gmail.com>
To: vojtech.toman@emc.com
Cc: xproc-dev@w3.org
Well this is a probably a more appropriate configuration although still does
nothing in this case.

     <p:viewport match="/Document/html/content/chunk/html">
          <p:exec command="/usr/bin/w3m"
            source-is-xml="true"
            result-is-xml="false"
            wrap-result-lines="true"
            args="'-dump'">
          </p:exec>
        </p:viewport>

Although when I change the path to /Document/html/content/chunk it does
return the html document wrapped although not a text version I'm looking
for. I'm looking more into the source code to figure it out.


Alex



On Thu, Feb 3, 2011 at 6:55 PM, Alex Muir <alex.g.muir@gmail.com> wrote:

>
> I've tried implementing using a p:exec call to w3m within my pipline but
> without success.
>
> HTML text is first converted to xhtml to be an input for w3m
>
>          <p:viewport match="/Document/html/content/chunk">
>              <p:unescape-markup>
>                <p:with-option name="content-type" select="'text/html'"/>
>              </p:unescape-markup>
>             </p:viewport>
>
> resulting...
>   <chunk>
>            <html xmlns:html="http://www.w3.org/1999/xhtml">
>                 <head>
> ....
>
> Then in theory the w3m would convert the xhtml to text however at the
> moment it does nothing as the html content is left unchanged. I don't get
> any error messages the process just keeps running following.
>
>            <p:viewport match="/Documen/html/content/chunk/html">
>               <p:exec command="/usr/libexec/w3m"
>                 source-is-xml="true"
>                 result-is-xml="false"
>                 wrap-result-lines="false">
>               </p:exec>
>             </p:viewport>
>
> I tinkered with paths and other options as well after reading ....
> http://xprocbook.com/book/refentry-16.html but always the same no change
> to the content.
>
> Anyway I don't really know if this should work at all in the first place
> but seems possible... Any ideas?
>
> Thanks much
> Alex
>
>
> On Tue, Feb 1, 2011 at 12:13 PM, Alex Muir <alex.g.muir@gmail.com> wrote:
>
>> Ah well that led me to think about lnyx, w3m...
>>
>> w3m input.html > out.txt
>>
>> w3m does the job well.
>>
>> Regards
>> Alex
>>
>>
>> On Tue, Feb 1, 2011 at 11:24 AM,  <vojtech.toman@emc.com> wrote:
>> > If you are in an *nix environment, you can also try using p:exec in
>> combination with the lesspipe.sh script (a preprocessor filter for less
>> capable of basic HTML "rendering"). But maybe I have completely
>> misunderstood your requirement.
>> >
>> > Regards,
>> > Vojtech
>> >
>> > --
>> > Vojtech Toman
>> > Consultant Software Engineer
>> > EMC | Information Intelligence Group
>> > vojtech.toman@emc.com
>> > http://developer.emc.com/xmltech
>> >
>> >
>> >> -----Original Message-----
>> >> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On
>> >> Behalf Of Alex Muir
>> >> Sent: Tuesday, February 01, 2011 11:12 AM
>> >> To: mozer
>> >> Cc: XProc Dev
>> >> Subject: Re: Are there any open source tools that work with xproc that
>> >> convert html to well formatted text?
>> >>
>> >> I'm working with some html documents the style of which looks like say
>> >> a straight forward word document which when I tried saving as text
>> >> from firefox looked a lot like the HTML version in terms of the
>> >> spacing of the text content,, except some tables which were garbage.
>> >> So a subsection in the HTML was still easily determined to be a
>> >> subsection in the text because the presentational formatting specified
>> >> in the HTML was preserved in the text output.
>> >>
>> >> I've found more success thus far identifying the different textual
>> >> elements of a text document than HTML perhaps because HTML has so many
>> >> possibilities of layouts whereas text is pretty simple thing to parse
>> >> out and identify where a table is or where a section, subsection is...
>> >>
>> >> Does that make sense regarding the well formatted?
>> >>
>> >> Alex
>> >>
>> >>
>> >> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote:
>> >> > oups read too fast : I read "well formed"
>> >> >
>> >> > What do you mean by well formatted text representation ?
>> >> >
>> >> > Xmlizer
>> >> >
>> >> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote:
>> >> >> p:unescape-markup
>> >> >> or
>> >> >> p:http-request should do that
>> >> >>
>> >> >> Xmlizer
>> >> >>
>> >> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com>
>> >> wrote:
>> >> >>> Hi,
>> >> >>>
>> >> >>> I'm interested to have a step in a pipeline that converts HTML to a
>> >> >>> well formatted text representation.
>> >> >>>
>> >> >>> Are there any open source tools that do that that fit into xproc?
>> >> >>>
>> >> >>> Thanks
>> >> >>>
>> >> >>> --
>> >> >>> Alex
>> >> >>> -----
>> >> >>> Currently:
>> >> >>> Freelance Software Engineer 6+ yrs exp
>> >> >>>
>> >> >>> Previously:
>> >> >>> https://sites.google.com/a/utg.edu.gm/alex/
>> >> >>>
>> >> >>>
>> >> >>> A Bafila, is two rivers flowing together as one:
>> >> >>> http://www.facebook.com/pages/Bafila/125611807494851
>> >> >>>
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Alex
>> >> -----
>> >> Currently:
>> >> Freelance Software Engineer 6+ yrs exp
>> >>
>> >> Previously:
>> >> https://sites.google.com/a/utg.edu.gm/alex/
>> >>
>> >>
>> >> A Bafila, is two rivers flowing together as one:
>> >> http://www.facebook.com/pages/Bafila/125611807494851
>> >>
>> >
>> >
>> >
>>
>>
>>
>> --
>> Alex
>> -----
>> Currently:
>> Freelance Software Engineer 6+ yrs exp
>>
>> Previously:
>> https://sites.google.com/a/utg.edu.gm/alex/
>>
>>
>> A Bafila, is two rivers flowing together as one:
>> http://www.facebook.com/pages/Bafila/125611807494851
>>
>
>
>
> --
> Alex
> -----
> Currently:
> Freelance Software Engineer 6+ yrs exp
>  <http://www.facebook.com/pages/Bafila/125611807494851>
> Previously:
> https://sites.google.com/a/utg.edu.gm/alex/
>
>
> A Bafila, is two rivers flowing together as one:
> http://www.facebook.com/pages/Bafila/125611807494851
>
>
>
>


-- 
Alex
-----
Currently:
Freelance Software Engineer 6+ yrs exp
<http://www.facebook.com/pages/Bafila/125611807494851>
Previously:
https://sites.google.com/a/utg.edu.gm/alex/


A Bafila, is two rivers flowing together as one:
http://www.facebook.com/pages/Bafila/125611807494851
Received on Monday, 7 February 2011 16:32:38 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 February 2011 16:32:39 GMT