W3C home > Mailing lists > Public > xproc-dev@w3.org > July 2011

Re: What tool is recommended to convert pdf to html

From: Alex Muir <alex.g.muir@gmail.com>
Date: Mon, 25 Jul 2011 15:38:52 +0000
Message-ID: <CAFtPEJYJJOLL7jY8u-=K3nFqM0gJMzGX4PSjteU-SB1Y60XZpg@mail.gmail.com>
To: Geert Josten <geert.josten@daidalos.nl>
Cc: XProc Dev <xproc-dev@w3.org>
Hi Geert,

Yeah all those things other than images would be useful. Generally speaking
it would be nice to have the text aligned the same way it's found in the pdf
file. Any tools good at that?

Alex

On Mon, Jul 25, 2011 at 2:13 PM, Geert Josten <geert.josten@daidalos.nl>wrote:

> Hi Alex,****
>
> ** **
>
> Well, does formatting include tables? Images, or images of statistics? Footnotes,
> headers, footers? Margin-notes? PDF is a rather crude data-format. Unless
> well annotated you will likely have to reconstruct things like that
> yourself. More advanced toolkits will try to help you, but not uncommonly
> from bad to worse.. ;-)****
>
> ** **
>
> Kind regards,****
>
> Geert****
>
> ** **
>
> *Van:* Alex Muir [mailto:alex.g.muir@gmail.com]
> *Verzonden:* maandag 25 juli 2011 16:03
> *Aan:* Geert Josten
> *CC:* XProc Dev
> *Onderwerp:* Re: What tool is recommended to convert pdf to html****
>
> ** **
>
> Hi Geert,
>
> Well it just has to preserve the formatting and text. Even if the
> formatting can be preserved in a text output that is also good. For what I'm
> doing the format of the content can be just as important as the content.
>
> I'll take a look at OCR, and I'm looking pdfbox ( thanks James)
>
> Thanks
> Alex****
>
> On Mon, Jul 25, 2011 at 1:55 PM, Geert Josten <geert.josten@daidalos.nl>
> wrote:****
>
> Hi Alex,****
>
>  ****
>
> Bit off-topic, but what the heck.. How detailed does the conversion need to
> be? There are literally hundreds of tools, but they suite various purposes.
> You could look in the area of OCR and closely-related tools to extract high
> detail, but there are also plenty tools that do text extraction, just for
> searching purposes.****
>
>  ****
>
> Kind regards,****
>
> Geert****
>
>  ****
>
> *Van:* xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] *Namens
> *Alex Muir
> *Verzonden:* maandag 25 juli 2011 15:45
> *Aan:* XProc Dev
> *Onderwerp:* What tool is recommended to convert pdf to html****
>
>  ****
>
> Hi,
>
> I'm wondering what tool would be recommended to convert pdf to html or xml
> effectively for a process to convert a whole bunch of pdf.
>
> Regards
>
>
> -- ****
>
> Alex Muir
> Instructor | Program Organizer - University Technology Student Work
> Experience Building
> University of the Gambia
> http://sites.utg.edu.gm/alex/<https://sites.google.com/a/utg.edu.gm/utsweb/>
>
> Low budget software development benefiting development in the Gambia, West
> Africa
> Experience of a lifetime, come to Gambia and Join UTSWEB -
> http://sites.utg.edu.gm/utsweb/<https://sites.google.com/a/utg.edu.gm/utsweb/>
> ****
>
> ** **
>
>
>
>
> -- ****
>
> Alex Muir
> Instructor | Program Organizer - University Technology Student Work
> Experience Building
> University of the Gambia
> http://sites.utg.edu.gm/alex/<https://sites.google.com/a/utg.edu.gm/utsweb/>
>
> Low budget software development benefiting development in the Gambia, West
> Africa
> Experience of a lifetime, come to Gambia and Join UTSWEB -
> http://sites.utg.edu.gm/utsweb/<https://sites.google.com/a/utg.edu.gm/utsweb/>
> ****
>
>
>
> ****
>



-- 
Alex Muir
Instructor | Program Organizer - University Technology Student Work
Experience Building
University of the Gambia
http://sites.utg.edu.gm/alex/<https://sites.google.com/a/utg.edu.gm/utsweb/>

Low budget software development benefiting development in the Gambia, West
Africa
Experience of a lifetime, come to Gambia and Join UTSWEB -
http://sites.utg.edu.gm/utsweb/<https://sites.google.com/a/utg.edu.gm/utsweb/>
Received on Monday, 25 July 2011 15:39:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 25 July 2011 15:39:20 GMT