- From: Alex Muir <alex.g.muir@gmail.com>
- Date: Tue, 1 Feb 2011 12:13:37 +0000
- To: vojtech.toman@emc.com
- Cc: xproc-dev@w3.org
Ah well that led me to think about lnyx, w3m... w3m input.html > out.txt w3m does the job well. Regards Alex On Tue, Feb 1, 2011 at 11:24 AM, <vojtech.toman@emc.com> wrote: > If you are in an *nix environment, you can also try using p:exec in combination with the lesspipe.sh script (a preprocessor filter for less capable of basic HTML "rendering"). But maybe I have completely misunderstood your requirement. > > Regards, > Vojtech > > -- > Vojtech Toman > Consultant Software Engineer > EMC | Information Intelligence Group > vojtech.toman@emc.com > http://developer.emc.com/xmltech > > >> -----Original Message----- >> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On >> Behalf Of Alex Muir >> Sent: Tuesday, February 01, 2011 11:12 AM >> To: mozer >> Cc: XProc Dev >> Subject: Re: Are there any open source tools that work with xproc that >> convert html to well formatted text? >> >> I'm working with some html documents the style of which looks like say >> a straight forward word document which when I tried saving as text >> from firefox looked a lot like the HTML version in terms of the >> spacing of the text content,, except some tables which were garbage. >> So a subsection in the HTML was still easily determined to be a >> subsection in the text because the presentational formatting specified >> in the HTML was preserved in the text output. >> >> I've found more success thus far identifying the different textual >> elements of a text document than HTML perhaps because HTML has so many >> possibilities of layouts whereas text is pretty simple thing to parse >> out and identify where a table is or where a section, subsection is... >> >> Does that make sense regarding the well formatted? >> >> Alex >> >> >> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote: >> > oups read too fast : I read "well formed" >> > >> > What do you mean by well formatted text representation ? >> > >> > Xmlizer >> > >> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote: >> >> p:unescape-markup >> >> or >> >> p:http-request should do that >> >> >> >> Xmlizer >> >> >> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com> >> wrote: >> >>> Hi, >> >>> >> >>> I'm interested to have a step in a pipeline that converts HTML to a >> >>> well formatted text representation. >> >>> >> >>> Are there any open source tools that do that that fit into xproc? >> >>> >> >>> Thanks >> >>> >> >>> -- >> >>> Alex >> >>> ----- >> >>> Currently: >> >>> Freelance Software Engineer 6+ yrs exp >> >>> >> >>> Previously: >> >>> https://sites.google.com/a/utg.edu.gm/alex/ >> >>> >> >>> >> >>> A Bafila, is two rivers flowing together as one: >> >>> http://www.facebook.com/pages/Bafila/125611807494851 >> >>> >> >>> >> >> >> > >> >> >> >> -- >> Alex >> ----- >> Currently: >> Freelance Software Engineer 6+ yrs exp >> >> Previously: >> https://sites.google.com/a/utg.edu.gm/alex/ >> >> >> A Bafila, is two rivers flowing together as one: >> http://www.facebook.com/pages/Bafila/125611807494851 >> > > > -- Alex ----- Currently: Freelance Software Engineer 6+ yrs exp Previously: https://sites.google.com/a/utg.edu.gm/alex/ A Bafila, is two rivers flowing together as one: http://www.facebook.com/pages/Bafila/125611807494851
Received on Tuesday, 1 February 2011 12:14:09 UTC