- From: Alex Muir <alex.g.muir@gmail.com>
- Date: Tue, 1 Feb 2011 10:12:24 +0000
- To: mozer <xmlizer@gmail.com>
- Cc: XProc Dev <xproc-dev@w3.org>
I'm working with some html documents the style of which looks like say a straight forward word document which when I tried saving as text from firefox looked a lot like the HTML version in terms of the spacing of the text content,, except some tables which were garbage. So a subsection in the HTML was still easily determined to be a subsection in the text because the presentational formatting specified in the HTML was preserved in the text output. I've found more success thus far identifying the different textual elements of a text document than HTML perhaps because HTML has so many possibilities of layouts whereas text is pretty simple thing to parse out and identify where a table is or where a section, subsection is... Does that make sense regarding the well formatted? Alex On Tue, Feb 1, 2011 at 9:54 AM, mozer <xmlizer@gmail.com> wrote: > oups read too fast : I read "well formed" > > What do you mean by well formatted text representation ? > > Xmlizer > > On Tue, Feb 1, 2011 at 10:53 AM, mozer <xmlizer@gmail.com> wrote: >> p:unescape-markup >> or >> p:http-request should do that >> >> Xmlizer >> >> On Tue, Feb 1, 2011 at 10:49 AM, Alex Muir <alex.g.muir@gmail.com> wrote: >>> Hi, >>> >>> I'm interested to have a step in a pipeline that converts HTML to a >>> well formatted text representation. >>> >>> Are there any open source tools that do that that fit into xproc? >>> >>> Thanks >>> >>> -- >>> Alex >>> ----- >>> Currently: >>> Freelance Software Engineer 6+ yrs exp >>> >>> Previously: >>> https://sites.google.com/a/utg.edu.gm/alex/ >>> >>> >>> A Bafila, is two rivers flowing together as one: >>> http://www.facebook.com/pages/Bafila/125611807494851 >>> >>> >> > -- Alex ----- Currently: Freelance Software Engineer 6+ yrs exp Previously: https://sites.google.com/a/utg.edu.gm/alex/ A Bafila, is two rivers flowing together as one: http://www.facebook.com/pages/Bafila/125611807494851
Received on Tuesday, 1 February 2011 10:12:56 UTC