W3C home > Mailing lists > Public > www-international@w3.org > April to June 2005

RE: Creating a PDF file with UTF-8 encoding through Servlet

From: Bruno Girin <Bruno.Girin@cambista.com>
Date: Tue, 24 May 2005 17:07:56 +0100
Message-ID: <2D25E0735620544FB83ABA9B07D6AC73093FAC@phaal.uk.cambista.net>
To: "Addison Phillips" <addison.phillips@quest.com>, <www-international@w3.org>
Addison, you're absolutely right. My mistake. But that does not explain how you can include non-Latin characters in a PDF file. Somewhere along the line, you need to specify the encoding of the characters inside the PDF file. I suppose that the PDF format has a way to specify the encoding of the characters inside the file, similar to what HTML does. So the original question from Sourav boils down to:

1. How do you specify a character encoding in a PDF file when you create it using PDFlibs?
2. How do you produce content encoded with the encoding specified in the PDF file; I believe you can do this with the Java OutputStreamWriter class as explained below.

The content type is then secondary information that tells the browser to fire Acrobat or any other reader when loading the PDF and should be "application/pdf".

For information, the solution we use in my company is to produce XSL-FO with the proper XML encoding specification and use Jakarta FOP to produce the PDF. It works just fine to produce Russian (or English) output, as long as you configure FOP properly so that it can find fonts that contain glyphs for all the characters present in the file.

Bruno Girin
Chief Technical Architect
Cambista Technologies Ltd


-----Original Message-----
From: Addison Phillips [mailto:addison.phillips@quest.com]
Sent: Tue 5/24/2005 4:01 PM
To: Bruno Girin; www-international@w3.org
Subject: RE: Creating a PDF file with UTF-8 encoding through Servlet
 
No, that's not right. The PDF file is a binary file. The text *INSIDE* the file (i.e. the text being encoded by the PDF library) has an encoding. But PDF file themselves do not have or need a charset parameter. Putting a charset parameter on a Content-Type of "application/*" is just silly.

Your browser does not read the text in a PDF. It calls the Acrobat plug-in which read the Acrobat file.

Addison

Addison P. Phillips
Globalization Architect, Quest Software
Chair, W3C Internationalization Core Working Group

Internationalization is not a feature.
It is an architecture. 

> -----Original Message-----
> From: www-international-request@w3.org [mailto:www-international-
> request@w3.org] On Behalf Of Bruno Girin
> Sent: 2005?5?24? 4:34
> To: www-international@w3.org
> Subject: FW: Creating a PDF file with UTF-8 encoding through Servlet
> 
> Sorry, sent this message to Khurram only, not the list.
> 
> 
> -----Original Message-----
> From: Bruno Girin
> Sent: Tue 5/24/2005 11:39 AM
> To: Khurram Ilyas
> Subject: RE: Creating a PDF file with UTF-8 encoding through Servlet
> 
> Addison, that's the whole point of Sourav's question: a PDF file is binary
> file that contians text data. As a consequence, you need to specify the
> encoding of the text data so that the computer that will read the PDF can
> properly read the binary stream and translate it into the correct
> characters to display.
> 
> To achieve this, you need 3 things:
> 1. the servlet needs to encode the binary stream using an encoding that is
> able to encode the totality of the character set used in the document. If
> it is Japanese, the best encoding is probably UTF-8.
> 2. the servlet needs to specify that same encoding in the content type
> 3. the PDF file presumably needs to contain encoding data so that the file
> can be re-read by a PDF viewer independantly of the download
> 
> To do 1, you need to enclose the output stream into an OutputStreamWriter
> that specifies the encoding, such as:
> Writer wout = new OutputStreamWriter(out, "UTF-8"); // out being the
> output stream obtained in Sourav's step 2
> then you call wout.write() and other Writer methods
> 
> To do 2, you just specify the encoding as part of the content type:
> response.setContentType("application/pdf; charset=utf-8");
> 
> 3 is dependant on the API you're using to create your PDF file. I don't
> know PDFlib so can't tell you what the call is.
> 
> Good luck with this.
> 
> Bruno Girin
> Chief Technical Architect
> Cambista Technologies Ltd
> 
> 
> -----Original Message-----
> From: www-international-request@w3.org on behalf of Khurram Ilyas
> Sent: Fri 5/20/2005 11:04 PM
> To: addison.phillips@quest.com; SOURAVM@infosys.com; www-
> international@w3.org
> Subject: RE: Creating a PDF file with UTF-8 encoding through Servlet
> 
> Instead of
> 
> response.setContentType("application/pdf");
> 
> 
> 
> try
> 
> response.setContentType("application/download");
> 
> 
> 
> 
> 
> 
> Best Regards,
> Khurram Ilyas
> 
> 
> 
> 
> >From: "Addison Phillips" <addison.phillips@quest.com>
> >To: "souravm" <SOURAVM@infosys.com>,<www-international@w3.org>
> >Subject: RE: Creating a PDF file with UTF-8 encoding through Servlet
> >Date: Fri, 20 May 2005 09:14:10 -0700
> >
> >
> >PDF files are binary, not text, objects.
> >
> >Addison
> >
> >Addison P. Phillips
> >Globalization Architect, Quest Software
> >Chair, W3C Internationalization Core Working Group
> >
> >Internationalization is not a feature.
> >It is an architecture.
> >
> > > -----Original Message-----
> > > From: www-international-request@w3.org [mailto:www-international-
> > > request@w3.org] On Behalf Of souravm
> > > Sent: 2005?5?20? 6:13
> > > To: www-international@w3.org
> > > Subject: Creating a PDF file with UTF-8 encoding through Servlet
> > >
> > >
> > > Hi All,
> > >
> > > I need to create and return back a PDF file from Servlet as a response
> to
> > > http request (typical download functionality).
> > >
> > > Now for this purpose I'm -
> > >
> > > 1. First setting following fields in response onject -
> > > response.setContentType("application/pdf");
> > > response.setHeader("Pragma", "");
> > > response.setHeader("Cache-Control", "");
> > > response.setDateHeader("Expires", 0);
> > >
> > > 2. After that I'm creating an OutputStream object from the response
> object.
> > >
> > > 3. Using theat OutputStream object I'm wrting the content of the PDF
> file
> > > (using APIs of PDFlib). Using PDFDocument.open(OutputStream) to create
> the
> > > document object.
> > >
> > > 4. After writing the content of the PDF I'm closing the PDF file
> > > (PDFDocument.close()).
> > >
> > > In this context, I'll like to know, don't I need to specify the
> encoding
> > > of the PDF document through the setContentType API ? Say, I'm creating
> a
> > > PDF file with Japanese content and I want the encoding of the file to
> be
> > > of Shift_JIS.
> > >
> > > Any pointer/information on thios would be highly appreciated.
> > >
> > > Regards,
> > > Sourav
> > >
> > >
> >
> >
> >
> 
> 
> 
> 
> 
> 
> _____________________________________________________________________
> This e-mail and attachments has been scanned for viruses. Please email
> virus@cambista.net if you have detected a virus in this mail.




_____________________________________________________________________
This e-mail and attachments has been scanned for viruses. Please email virus@cambista.net if you have detected a virus in this mail.
Received on Tuesday, 24 May 2005 16:07:14 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:05 GMT