- From: David Woolley <forums@david-woolley.me.uk>
- Date: Tue, 27 Mar 2012 07:44:49 +0100
- To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Ginger Claassen wrote: > > Where we just talk about PDF files - maybe you can help me with some > small but annoying thing. Sometimes I have PDF files with pictures > inside which cannot be recognized by OmniPage since the picture format > is not recognized. Does anyone here has an idea what kind of pictures > OmniPage can or cannot recognize in PDF files? > I'm assuming you are talking about image only PDFs. If someone did a DIY job, using free tools, they might have stored images with DCT (JPEG-like) compression. That is not a sensible format for scans of text, so an OCR program might either not be prepared to handle it, or might have trouble with it. For documents that are already in revisable form and have been authored with appropriate tools, PDFs should be using vector formats for diagrams and charts. These are more semantic than bitmaps, in particular text is stored as text, not images. However, an OCR program might not like them. They are underused these days because people don't seem to understand they are possible, possibly because, until recently, there have been no widely supported vector formats supported by web browsers. If you really must put vector images through OCR, you can use ghostscript to image them as bitmaps, but you will lose a lot of information in the process. -- David Woolley Emails are not formal business letters, whatever businesses may want. RFC1855 says there should be an address here, but, in a world of spam, that is no longer good advice, as archive address hiding may not work.
Received on Tuesday, 27 March 2012 06:45:18 UTC