W3C home > Mailing lists > Public > www-international@w3.org > July to September 2006

Re: Unix cmd line utility for Multibyte PDF -> Text

From: cstrobbe <Christophe.Strobbe@esat.kuleuven.be>
Date: Fri, 29 Sep 2006 07:04:52 +0200
Message-ID: <1159506292.451ca97431235@webmail2.kuleuven.be>
To: www-international@w3.org

Hi Michael,


Quoting Michael Monaghan <Michael.Monaghan@Sun.COM>:

> 
> Hi,
> 
> I need a pdf -> text command line utility for Unix/Solaris that
> won't corrupt non-ASCII characters.


A few years ago I used PDFBox, a Java PDF library, to extract text from 
PDF (http://www.pdfbox.org/). I seem to remember that it also worked 
for non-ASCII characters.

Best regards,

Christophe

-- 
Christophe Strobbe
K.U.Leuven - Departement of Electrical Engineering - Research Group on 
Document Architectures
Kasteelpark Arenberg 10 - 3001 Leuven-Heverlee - BELGIUM
tel: +32 16 32 85 51
http://www.docarch.be/ 

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
Received on Friday, 29 September 2006 05:05:08 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:08 GMT