Re: Removing PDFs and accessibility from David Woolley on 2012-03-27 (w3c-wai-ig@w3.org from January to March 2012)

From: David Woolley <forums@david-woolley.me.uk>
Date: Tue, 27 Mar 2012 07:44:49 +0100
To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Message-ID: <4F7161E1.8080609@david-woolley.me.uk>

Ginger Claassen wrote:
> 
> Where we just talk about PDF files - maybe you can help me with some 
> small but annoying thing. Sometimes I have PDF files with pictures 
> inside which cannot be recognized by OmniPage since the picture format 
> is not recognized. Does anyone here has an idea what kind of pictures 
> OmniPage can or cannot recognize in PDF files?
> 

I'm assuming you are talking about image only PDFs.  If someone did a 
DIY job, using free tools, they might have stored images with DCT 
(JPEG-like) compression.  That is not a sensible format for scans of 
text, so an OCR program might either not be prepared to handle it, or 
might have trouble with it.

For documents that are already in revisable form and have been authored 
with appropriate tools, PDFs should be using vector formats for diagrams 
and charts.  These are more semantic than bitmaps, in particular text is 
stored as text, not images.  However, an OCR program might not like them.

They are underused these days because people don't seem to understand 
they are possible, possibly because, until recently, there have been no 
widely supported vector formats supported by web browsers.

If you really must put vector images through OCR, you can use 
ghostscript to image them as bitmaps, but you will lose a lot of 
information in the process.

-- 
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

Received on Tuesday, 27 March 2012 06:45:18 UTC