W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2011

RE: Are there any open source tools that work with xproc that convert html to well formatted text?

From: <vojtech.toman@emc.com>
Date: Tue, 1 Feb 2011 06:24:07 -0500
To: <xproc-dev@w3.org>
Message-ID: <3799D0FD120AD940B731A37E36DAF3FE32B14E77E3@MX20A.corp.emc.com>
If you are in an *nix environment, you can also try using p:exec in combination with the lesspipe.sh script (a preprocessor filter for less capable of basic HTML "rendering"). But maybe I have completely misunderstood your requirement. 

Regards,
Vojtech

--
Vojtech Toman
Consultant Software Engineer
EMC | Information Intelligence Group
vojtech.toman@emc.com
http://developer.emc.com/xmltech


> -----Original Message-----
> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On
> Behalf Of Alex Muir
> Sent: Tuesday, February 01, 2011 11:12 AM
> To: mozer
> Cc: XProc Dev
> Subject: Re: Are there any open source tools that work with xproc that
> convert html to well formatted text?
> 
> I'm working with some html documents the style of which looks like say
> a straight forward word document which when I tried saving as text
> from firefox looked a lot like the HTML version in terms of the
> spacing of the text content,, except some tables which were garbage.
> So a subsection in the HTML was still easily determined to be a
> subsection in the text because the presentational formatting specified
> in the HTML was preserved in the text output.
> 
> I've found more success thus far identifying the different textual
> elements of a text document than HTML perhaps because HTML has so many
> possibilities of layouts whereas text is pretty simple thing to parse
> out and identify where a table is or where a section, subsection is...
> 
> Does that make sense regarding the well formatted?
> 
> Alex
> 
> 
> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote:
> > oups read too fast : I read "well formed"
> >
> > What do you mean by well formatted text representation ?
> >
> > Xmlizer
> >
> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote:
> >> p:unescape-markup
> >> or
> >> p:http-request should do that
> >>
> >> Xmlizer
> >>
> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com>
> wrote:
> >>> Hi,
> >>>
> >>> I'm interested to have a step in a pipeline that converts HTML to a
> >>> well formatted text representation.
> >>>
> >>> Are there any open source tools that do that that fit into xproc?
> >>>
> >>> Thanks
> >>>
> >>> --
> >>> Alex
> >>> -----
> >>> Currently:
> >>> Freelance Software Engineer 6+ yrs exp
> >>>
> >>> Previously:
> >>> https://sites.google.com/a/utg.edu.gm/alex/
> >>>
> >>>
> >>> A Bafila, is two rivers flowing together as one:
> >>> http://www.facebook.com/pages/Bafila/125611807494851
> >>>
> >>>
> >>
> >
> 
> 
> 
> --
> Alex
> -----
> Currently:
> Freelance Software Engineer 6+ yrs exp
> 
> Previously:
> https://sites.google.com/a/utg.edu.gm/alex/
> 
> 
> A Bafila, is two rivers flowing together as one:
> http://www.facebook.com/pages/Bafila/125611807494851
> 
Received on Tuesday, 1 February 2011 11:24:59 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 1 February 2011 11:25:00 GMT