- From: Christophe Strobbe <christophe.strobbe@esat.kuleuven.be>
- Date: Mon, 12 Mar 2007 12:13:58 +0100
- To: w3c-wai-ig@w3.org
Hi David, At 20:01 10/03/2007, David Woolley wrote: >(...) > > Option - TIFF Format". The PDF contains the text of the article in > > the form of scanned images. There are no plain text or HTML-versions > >I believe the proper Adobe tools can produce an OCRed underlay for the >scans. Can you confirm that none has been included. (Note that >modern PDFs can be flagged as allowing access to the text for >accessibility, but not for cut and paste.) Actually, most >vaguely recently published journals are available as proper PDFs, so, >if they are using scans, rather than PDF rendered to a bitmap, they >may have very nobbled access to the originals. I checked the document properties, which tell me that the "PDF producer" is not Adobe Acrobat but iText 1.3 (a free PDF library in Java; see <http://www.lowagie.com/iText/>). The security tab in document properties says that printing, changing the document, content copying or extration, and content extraction for accessibility are allowed. I ran two such PDF files through the accessibility checker in Adobe Acrobat Professional 7.0. For each page, it says: "1 image(s) with no alternate text". The accessbility report also says that the document is not tagged and that there are 7 text blocks with no language specified. Searching for terms in the text yields no results at all. After performing OCR, it was possible to search the text and to select spans of text. The accessibility report still says: "1 image(s) with no alternate text". So I assume that there was no text "behind" the images. I couldn't find anything on the JSTOR site that said they use images because journal publishers won't let them publish the articles as electronic text. At <http://www.jstor.org/about/images.html> they say that they use images because it is their goal to produce faithful replications. Best regards, Christophe Strobbe -- Christophe Strobbe K.U.Leuven - Departement of Electrical Engineering - Research Group on Document Architectures Kasteelpark Arenberg 10 - 3001 Leuven-Heverlee - BELGIUM tel: +32 16 32 85 51 http://www.docarch.be/ Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
Received on Monday, 12 March 2007 11:14:47 UTC