W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2011

Re: Are there any open source tools that work with xproc that convert html to well formatted text?

From: Alex Muir <alex.g.muir@gmail.com>
Date: Tue, 1 Feb 2011 12:13:37 +0000
Message-ID: <AANLkTimZ_shpDmg_GX5GgDqWN5jo8pu2UxNYW6azCwSO@mail.gmail.com>
To: vojtech.toman@emc.com
Cc: xproc-dev@w3.org
Ah well that led me to think about lnyx, w3m...

w3m input.html > out.txt

w3m does the job well.

Regards
Alex


On Tue, Feb 1, 2011 at 11:24 AM,  <vojtech.toman@emc.com> wrote:
> If you are in an *nix environment, you can also try using p:exec in combination with the lesspipe.sh script (a preprocessor filter for less capable of basic HTML "rendering"). But maybe I have completely misunderstood your requirement.
>
> Regards,
> Vojtech
>
> --
> Vojtech Toman
> Consultant Software Engineer
> EMC | Information Intelligence Group
> vojtech.toman@emc.com
> http://developer.emc.com/xmltech
>
>
>> -----Original Message-----
>> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On
>> Behalf Of Alex Muir
>> Sent: Tuesday, February 01, 2011 11:12 AM
>> To: mozer
>> Cc: XProc Dev
>> Subject: Re: Are there any open source tools that work with xproc that
>> convert html to well formatted text?
>>
>> I'm working with some html documents the style of which looks like say
>> a straight forward word document which when I tried saving as text
>> from firefox looked a lot like the HTML version in terms of the
>> spacing of the text content,, except some tables which were garbage.
>> So a subsection in the HTML was still easily determined to be a
>> subsection in the text because the presentational formatting specified
>> in the HTML was preserved in the text output.
>>
>> I've found more success thus far identifying the different textual
>> elements of a text document than HTML perhaps because HTML has so many
>> possibilities of layouts whereas text is pretty simple thing to parse
>> out and identify where a table is or where a section, subsection is...
>>
>> Does that make sense regarding the well formatted?
>>
>> Alex
>>
>>
>> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote:
>> > oups read too fast : I read "well formed"
>> >
>> > What do you mean by well formatted text representation ?
>> >
>> > Xmlizer
>> >
>> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote:
>> >> p:unescape-markup
>> >> or
>> >> p:http-request should do that
>> >>
>> >> Xmlizer
>> >>
>> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com>
>> wrote:
>> >>> Hi,
>> >>>
>> >>> I'm interested to have a step in a pipeline that converts HTML to a
>> >>> well formatted text representation.
>> >>>
>> >>> Are there any open source tools that do that that fit into xproc?
>> >>>
>> >>> Thanks
>> >>>
>> >>> --
>> >>> Alex
>> >>> -----
>> >>> Currently:
>> >>> Freelance Software Engineer 6+ yrs exp
>> >>>
>> >>> Previously:
>> >>> https://sites.google.com/a/utg.edu.gm/alex/
>> >>>
>> >>>
>> >>> A Bafila, is two rivers flowing together as one:
>> >>> http://www.facebook.com/pages/Bafila/125611807494851
>> >>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Alex
>> -----
>> Currently:
>> Freelance Software Engineer 6+ yrs exp
>>
>> Previously:
>> https://sites.google.com/a/utg.edu.gm/alex/
>>
>>
>> A Bafila, is two rivers flowing together as one:
>> http://www.facebook.com/pages/Bafila/125611807494851
>>
>
>
>



-- 
Alex
-----
Currently:
Freelance Software Engineer 6+ yrs exp

Previously:
https://sites.google.com/a/utg.edu.gm/alex/


A Bafila, is two rivers flowing together as one:
http://www.facebook.com/pages/Bafila/125611807494851
Received on Tuesday, 1 February 2011 12:14:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 1 February 2011 12:14:09 GMT