W3C home > Mailing lists > Public > xproc-dev@w3.org > July 2011

Re: What tool is recommended to convert pdf to html

From: Conal Tuohy <conal.tuohy@versi.edu.au>
Date: Tue, 26 Jul 2011 10:13:38 +1000
Message-ID: <4E2E06B2.5010607@versi.edu.au>
To: Alex Muir <alex.g.muir@gmail.com>
CC: Geert Josten <geert.josten@daidalos.nl>, XProc Dev <xproc-dev@w3.org>
I've also used the pdf2html utility and I fully agree with Geert: it was 
the best tool I found, and for an XProc developer it operates at the 
right level; basic conversion from PDF to XML, and anything else 
(guessing the text structure from the positional information) you have 
to do yourself.

In my case I was using it to generate TEI from PDF files (with the TEI 
"facsimile" markup to record the positional information), and from that 
I output HTML (using CSS absolute positioning to reproduce a faithful 
looking rendition of the original PDF).

NB something Geert didn't mention is that pdf2html also gives you font 
information, and this can also be used to infer semantic markup such as 
headings.

Con

On 26/07/11 02:15, Geert Josten wrote:
>
> Hi Alex,
>
> If you are asking for my personal experiences: I've seen several 
> toolkits (big faceless, pdflib, cambridgedocs, aabby, itext, ..), but 
> most of them did too much in the sense that let's say 80% was 
> reasonably accurate, but the remainder ridiculously wrong. Textual 
> accuracy is always high as long as you are not working with image-pdf 
> (scanned documents), but once you ask such tools to recognize for 
> instance tables, they will try to find tables everywhere, even within 
> diagrams. :-/
>
> For one particular occasion I therefor decided to do the hard work 
> myself. If you know your content well enough, you can be much more 
> accurate with much less. I used the pretty blunt tool 'pdf2html' (just 
> a very simple command-line tool) to extract tabular data amongst 
> others. I wrote a small blog article about it few months ago: 
> http://grtjn.blogspot.com/2011/05/pdf-to-xml-conversion-with-xslt-20.html
>
> (sorry for the self-promotion ;-)
>
> Kind regards,
>
> Geert
>
> *Van:*Alex Muir [mailto:alex.g.muir@gmail.com]
> *Verzonden:* maandag 25 juli 2011 17:39
> *Aan:* Geert Josten
> *CC:* XProc Dev
> *Onderwerp:* Re: What tool is recommended to convert pdf to html
>
> Hi Geert,
>
> Yeah all those things other than images would be useful. Generally 
> speaking it would be nice to have the text aligned the same way it's 
> found in the pdf file. Any tools good at that?
>
> Alex
>
> On Mon, Jul 25, 2011 at 2:13 PM, Geert Josten 
> <geert.josten@daidalos.nl <mailto:geert.josten@daidalos.nl>> wrote:
>
> Hi Alex,
>
> Well, does formatting include tables? Images, or images of statistics? 
> Footnotes, headers, footers? Margin-notes? PDF is a rather crude 
> data-format. Unless well annotated you will likely have to reconstruct 
> things like that yourself. More advanced toolkits will try to help 
> you, but not uncommonly from bad to worse.. ;-)
>
> Kind regards,
>
> Geert
>
> *Van:*Alex Muir [mailto:alex.g.muir@gmail.com 
> <mailto:alex.g.muir@gmail.com>]
> *Verzonden:* maandag 25 juli 2011 16:03
> *Aan:* Geert Josten
> *CC:* XProc Dev
> *Onderwerp:* Re: What tool is recommended to convert pdf to html
>
> Hi Geert,
>
> Well it just has to preserve the formatting and text. Even if the 
> formatting can be preserved in a text output that is also good. For 
> what I'm doing the format of the content can be just as important as 
> the content.
>
> I'll take a look at OCR, and I'm looking pdfbox ( thanks James)
>
> Thanks
> Alex
>
> On Mon, Jul 25, 2011 at 1:55 PM, Geert Josten 
> <geert.josten@daidalos.nl <mailto:geert.josten@daidalos.nl>> wrote:
>
> Hi Alex,
>
> Bit off-topic, but what the heck.. How detailed does the conversion 
> need to be? There are literally hundreds of tools, but they suite 
> various purposes. You could look in the area of OCR and 
> closely-related tools to extract high detail, but there are also 
> plenty tools that do text extraction, just for searching purposes.
>
> Kind regards,
>
> Geert
>
> *Van:*xproc-dev-request@w3.org <mailto:xproc-dev-request@w3.org> 
> [mailto:xproc-dev-request@w3.org <mailto:xproc-dev-request@w3.org>] 
> *Namens *Alex Muir
> *Verzonden:* maandag 25 juli 2011 15:45
> *Aan:* XProc Dev
> *Onderwerp:* What tool is recommended to convert pdf to html
>
> Hi,
>
> I'm wondering what tool would be recommended to convert pdf to html or 
> xml effectively for a process to convert a whole bunch of pdf.
>
> Regards
>
>
> -- 
>
> Alex Muir
> Instructor | Program Organizer - University Technology Student Work 
> Experience Building
> University of the Gambia
> http://sites.utg.edu.gm/alex/ 
> <https://sites.google.com/a/utg.edu.gm/utsweb/>
>
> Low budget software development benefiting development in the Gambia, 
> West Africa
> Experience of a lifetime, come to Gambia and Join UTSWEB - 
> http://sites.utg.edu.gm/utsweb/ 
> <https://sites.google.com/a/utg.edu.gm/utsweb/>
>
>
>
>
> -- 
>
> Alex Muir
> Instructor | Program Organizer - University Technology Student Work 
> Experience Building
> University of the Gambia
> http://sites.utg.edu.gm/alex/ 
> <https://sites.google.com/a/utg.edu.gm/utsweb/>
>
> Low budget software development benefiting development in the Gambia, 
> West Africa
> Experience of a lifetime, come to Gambia and Join UTSWEB - 
> http://sites.utg.edu.gm/utsweb/ 
> <https://sites.google.com/a/utg.edu.gm/utsweb/>
>
>
>
>
> -- 
>
> Alex Muir
> Instructor | Program Organizer - University Technology Student Work 
> Experience Building
> University of the Gambia
> http://sites.utg.edu.gm/alex/ 
> <https://sites.google.com/a/utg.edu.gm/utsweb/>
>
> Low budget software development benefiting development in the Gambia, 
> West Africa
> Experience of a lifetime, come to Gambia and Join UTSWEB - 
> http://sites.utg.edu.gm/utsweb/ 
> <https://sites.google.com/a/utg.edu.gm/utsweb/>
>
>
>


-- 
Conal Tuohy
eResearch Business Analyst
Victorian eResearch Strategic Initiative
+61-466324297
Received on Tuesday, 26 July 2011 00:14:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 26 July 2011 00:14:19 GMT