- From: Alex Muir <alex.g.muir@gmail.com>
- Date: Mon, 7 Feb 2011 16:32:06 +0000
- To: vojtech.toman@emc.com
- Cc: xproc-dev@w3.org
- Message-ID: <AANLkTi=5xe-jwey=1WLGDmUA7kr3p9B=r2xx3VU=ZMw_@mail.gmail.com>
Well this is a probably a more appropriate configuration although still does nothing in this case. <p:viewport match="/Document/html/content/chunk/html"> <p:exec command="/usr/bin/w3m" source-is-xml="true" result-is-xml="false" wrap-result-lines="true" args="'-dump'"> </p:exec> </p:viewport> Although when I change the path to /Document/html/content/chunk it does return the html document wrapped although not a text version I'm looking for. I'm looking more into the source code to figure it out. Alex On Thu, Feb 3, 2011 at 6:55 PM, Alex Muir <alex.g.muir@gmail.com> wrote: > > I've tried implementing using a p:exec call to w3m within my pipline but > without success. > > HTML text is first converted to xhtml to be an input for w3m > > <p:viewport match="/Document/html/content/chunk"> > <p:unescape-markup> > <p:with-option name="content-type" select="'text/html'"/> > </p:unescape-markup> > </p:viewport> > > resulting... > <chunk> > <html xmlns:html="http://www.w3.org/1999/xhtml"> > <head> > .... > > Then in theory the w3m would convert the xhtml to text however at the > moment it does nothing as the html content is left unchanged. I don't get > any error messages the process just keeps running following. > > <p:viewport match="/Documen/html/content/chunk/html"> > <p:exec command="/usr/libexec/w3m" > source-is-xml="true" > result-is-xml="false" > wrap-result-lines="false"> > </p:exec> > </p:viewport> > > I tinkered with paths and other options as well after reading .... > http://xprocbook.com/book/refentry-16.html but always the same no change > to the content. > > Anyway I don't really know if this should work at all in the first place > but seems possible... Any ideas? > > Thanks much > Alex > > > On Tue, Feb 1, 2011 at 12:13 PM, Alex Muir <alex.g.muir@gmail.com> wrote: > >> Ah well that led me to think about lnyx, w3m... >> >> w3m input.html > out.txt >> >> w3m does the job well. >> >> Regards >> Alex >> >> >> On Tue, Feb 1, 2011 at 11:24 AM, <vojtech.toman@emc.com> wrote: >> > If you are in an *nix environment, you can also try using p:exec in >> combination with the lesspipe.sh script (a preprocessor filter for less >> capable of basic HTML "rendering"). But maybe I have completely >> misunderstood your requirement. >> > >> > Regards, >> > Vojtech >> > >> > -- >> > Vojtech Toman >> > Consultant Software Engineer >> > EMC | Information Intelligence Group >> > vojtech.toman@emc.com >> > http://developer.emc.com/xmltech >> > >> > >> >> -----Original Message----- >> >> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On >> >> Behalf Of Alex Muir >> >> Sent: Tuesday, February 01, 2011 11:12 AM >> >> To: mozer >> >> Cc: XProc Dev >> >> Subject: Re: Are there any open source tools that work with xproc that >> >> convert html to well formatted text? >> >> >> >> I'm working with some html documents the style of which looks like say >> >> a straight forward word document which when I tried saving as text >> >> from firefox looked a lot like the HTML version in terms of the >> >> spacing of the text content,, except some tables which were garbage. >> >> So a subsection in the HTML was still easily determined to be a >> >> subsection in the text because the presentational formatting specified >> >> in the HTML was preserved in the text output. >> >> >> >> I've found more success thus far identifying the different textual >> >> elements of a text document than HTML perhaps because HTML has so many >> >> possibilities of layouts whereas text is pretty simple thing to parse >> >> out and identify where a table is or where a section, subsection is... >> >> >> >> Does that make sense regarding the well formatted? >> >> >> >> Alex >> >> >> >> >> >> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote: >> >> > oups read too fast : I read "well formed" >> >> > >> >> > What do you mean by well formatted text representation ? >> >> > >> >> > Xmlizer >> >> > >> >> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote: >> >> >> p:unescape-markup >> >> >> or >> >> >> p:http-request should do that >> >> >> >> >> >> Xmlizer >> >> >> >> >> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com> >> >> wrote: >> >> >>> Hi, >> >> >>> >> >> >>> I'm interested to have a step in a pipeline that converts HTML to a >> >> >>> well formatted text representation. >> >> >>> >> >> >>> Are there any open source tools that do that that fit into xproc? >> >> >>> >> >> >>> Thanks >> >> >>> >> >> >>> -- >> >> >>> Alex >> >> >>> ----- >> >> >>> Currently: >> >> >>> Freelance Software Engineer 6+ yrs exp >> >> >>> >> >> >>> Previously: >> >> >>> https://sites.google.com/a/utg.edu.gm/alex/ >> >> >>> >> >> >>> >> >> >>> A Bafila, is two rivers flowing together as one: >> >> >>> http://www.facebook.com/pages/Bafila/125611807494851 >> >> >>> >> >> >>> >> >> >> >> >> > >> >> >> >> >> >> >> >> -- >> >> Alex >> >> ----- >> >> Currently: >> >> Freelance Software Engineer 6+ yrs exp >> >> >> >> Previously: >> >> https://sites.google.com/a/utg.edu.gm/alex/ >> >> >> >> >> >> A Bafila, is two rivers flowing together as one: >> >> http://www.facebook.com/pages/Bafila/125611807494851 >> >> >> > >> > >> > >> >> >> >> -- >> Alex >> ----- >> Currently: >> Freelance Software Engineer 6+ yrs exp >> >> Previously: >> https://sites.google.com/a/utg.edu.gm/alex/ >> >> >> A Bafila, is two rivers flowing together as one: >> http://www.facebook.com/pages/Bafila/125611807494851 >> > > > > -- > Alex > ----- > Currently: > Freelance Software Engineer 6+ yrs exp > <http://www.facebook.com/pages/Bafila/125611807494851> > Previously: > https://sites.google.com/a/utg.edu.gm/alex/ > > > A Bafila, is two rivers flowing together as one: > http://www.facebook.com/pages/Bafila/125611807494851 > > > > -- Alex ----- Currently: Freelance Software Engineer 6+ yrs exp <http://www.facebook.com/pages/Bafila/125611807494851> Previously: https://sites.google.com/a/utg.edu.gm/alex/ A Bafila, is two rivers flowing together as one: http://www.facebook.com/pages/Bafila/125611807494851
Received on Monday, 7 February 2011 16:32:38 UTC