- From: Alex Muir <alex.g.muir@gmail.com>
- Date: Thu, 3 Feb 2011 18:55:56 +0000
- To: vojtech.toman@emc.com
- Cc: xproc-dev@w3.org
- Message-ID: <AANLkTikFVG1fmEZeOaEy0nx1PZqo5nR7g2X6Nk8zkC5L@mail.gmail.com>
I've tried implementing using a p:exec call to w3m within my pipline but without success. HTML text is first converted to xhtml to be an input for w3m <p:viewport match="/Document/html/content/chunk"> <p:unescape-markup> <p:with-option name="content-type" select="'text/html'"/> </p:unescape-markup> </p:viewport> resulting... <chunk> <html xmlns:html="http://www.w3.org/1999/xhtml"> <head> .... Then in theory the w3m would convert the xhtml to text however at the moment it does nothing as the html content is left unchanged. I don't get any error messages the process just keeps running following. <p:viewport match="/Documen/html/content/chunk/html"> <p:exec command="/usr/libexec/w3m" source-is-xml="true" result-is-xml="false" wrap-result-lines="false"> </p:exec> </p:viewport> I tinkered with paths and other options as well after reading .... http://xprocbook.com/book/refentry-16.html but always the same no change to the content. Anyway I don't really know if this should work at all in the first place but seems possible... Any ideas? Thanks much Alex On Tue, Feb 1, 2011 at 12:13 PM, Alex Muir <alex.g.muir@gmail.com> wrote: > Ah well that led me to think about lnyx, w3m... > > w3m input.html > out.txt > > w3m does the job well. > > Regards > Alex > > > On Tue, Feb 1, 2011 at 11:24 AM, <vojtech.toman@emc.com> wrote: > > If you are in an *nix environment, you can also try using p:exec in > combination with the lesspipe.sh script (a preprocessor filter for less > capable of basic HTML "rendering"). But maybe I have completely > misunderstood your requirement. > > > > Regards, > > Vojtech > > > > -- > > Vojtech Toman > > Consultant Software Engineer > > EMC | Information Intelligence Group > > vojtech.toman@emc.com > > http://developer.emc.com/xmltech > > > > > >> -----Original Message----- > >> From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On > >> Behalf Of Alex Muir > >> Sent: Tuesday, February 01, 2011 11:12 AM > >> To: mozer > >> Cc: XProc Dev > >> Subject: Re: Are there any open source tools that work with xproc that > >> convert html to well formatted text? > >> > >> I'm working with some html documents the style of which looks like say > >> a straight forward word document which when I tried saving as text > >> from firefox looked a lot like the HTML version in terms of the > >> spacing of the text content,, except some tables which were garbage. > >> So a subsection in the HTML was still easily determined to be a > >> subsection in the text because the presentational formatting specified > >> in the HTML was preserved in the text output. > >> > >> I've found more success thus far identifying the different textual > >> elements of a text document than HTML perhaps because HTML has so many > >> possibilities of layouts whereas text is pretty simple thing to parse > >> out and identify where a table is or where a section, subsection is... > >> > >> Does that make sense regarding the well formatted? > >> > >> Alex > >> > >> > >> On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote: > >> > oups read too fast : I read "well formed" > >> > > >> > What do you mean by well formatted text representation ? > >> > > >> > Xmlizer > >> > > >> > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote: > >> >> p:unescape-markup > >> >> or > >> >> p:http-request should do that > >> >> > >> >> Xmlizer > >> >> > >> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com> > >> wrote: > >> >>> Hi, > >> >>> > >> >>> I'm interested to have a step in a pipeline that converts HTML to a > >> >>> well formatted text representation. > >> >>> > >> >>> Are there any open source tools that do that that fit into xproc? > >> >>> > >> >>> Thanks > >> >>> > >> >>> -- > >> >>> Alex > >> >>> ----- > >> >>> Currently: > >> >>> Freelance Software Engineer 6+ yrs exp > >> >>> > >> >>> Previously: > >> >>> https://sites.google.com/a/utg.edu.gm/alex/ > >> >>> > >> >>> > >> >>> A Bafila, is two rivers flowing together as one: > >> >>> http://www.facebook.com/pages/Bafila/125611807494851 > >> >>> > >> >>> > >> >> > >> > > >> > >> > >> > >> -- > >> Alex > >> ----- > >> Currently: > >> Freelance Software Engineer 6+ yrs exp > >> > >> Previously: > >> https://sites.google.com/a/utg.edu.gm/alex/ > >> > >> > >> A Bafila, is two rivers flowing together as one: > >> http://www.facebook.com/pages/Bafila/125611807494851 > >> > > > > > > > > > > -- > Alex > ----- > Currently: > Freelance Software Engineer 6+ yrs exp > > Previously: > https://sites.google.com/a/utg.edu.gm/alex/ > > > A Bafila, is two rivers flowing together as one: > http://www.facebook.com/pages/Bafila/125611807494851 > -- Alex ----- Currently: Freelance Software Engineer 6+ yrs exp <http://www.facebook.com/pages/Bafila/125611807494851> Previously: https://sites.google.com/a/utg.edu.gm/alex/ A Bafila, is two rivers flowing together as one: http://www.facebook.com/pages/Bafila/125611807494851
Received on Thursday, 3 February 2011 18:56:29 UTC