- From: Bruce Bailey <bbailey@clark.net>
- Date: Thu, 03 Sep 1998 11:24:32 -0400
- To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
- CC: Paul Stauffer 301-827-5694 FAX 301-443-6385 <STAUFFERP@cder.fda.gov>, Robert Neff <rcn@fenix2.dol-esa.gov>, "T. V. Raman" <raman@Adobe.COM>
Much thanks to Robert Neff and Paul Stauffer. I'll share the good stuff (links) first and then proceed with my pontificating! Skeptic that I am, Paul has pointed me to a mainstream public site where PDFs are, in my humble opinion, good and appropriate. Some examples (and what happens with the Access Adobe translation): http://www.fda.gov/cder/guidance/1326fnl.pdf : Started life as a word processing document and is pretty small compared to the other samples referenced. The text comes through perfectly. Most formatting is lost, although bold and italics are retained. The graphic and emphasis from title page are lost. http://www.fda.gov/cder/guidance/1716dft.pdf : Also originally a word processing document, but this one contains formulas written using built-in tools (as opposed to being pasted in as a graphic). These equations are handled in a consistent fashion, but too much information is lost to be usable. http://www.fda.gov/cder/foi/nda/97/020184ap.pdf : Is a good example of a composite document; it includes text of varying quality, graphical formulas, and attachments of typed in forms. The whole thing has been run through a decent OCR process, but with predictable results. I would guess it is about 95% accurate. (Those who use OCR daily will tell you that this level of accuracy is not acceptable.) The formulas are total gibberish and misspelled words and artifacts abound. http://www.fda.gov/cder/guidance/old098fn.pdf : Is an older document, and has not been run through OCR. The translation shows only the page numbers! No warning messages of any kind are generated. It is interesting to note that all of the above look similar in an Acrobat Reader window! The casual browser would have no way of knowing which is an image and which has text. A word search returns only "not found", even when there is no text that can be scanned! It was interesting to me that Access Adobe return translations to me faster than Acrobat Reader. This is not too surprising given that both Adobe and the FDA have T3 connections to the internet and I am using dial-up! Were the translations better, no doubt the service would be totally over whelmed by home users. I postulate that it is in Adobe's interest to keep this free service mediocre! The curious might wish to try pasting the above links through their simple form (http://access.adobe.com/simple_form.html). The Food and Drug Administration basically has to choose between providing copious PDFs or trickling out HTML. The documents are coming from a variety of sources (including paper). This is the kind of devil-in-the-detail choice disability rights advocates are regularly faced with. Do we wage the impossible war against the system (and thereby be true to our ideals, but in the meantime accomplish little) or work from within the system to effect change (and in the meantime feel compromised, and probably give up the chance for radical improvements). Given the current (less than acceptable) state of the art with regard to PDF access, which (less than ideal) goal do we pursue: 1) Purge PDF from the web with the same vigor we fight missing ALT text. This would include removing Access Adobe, since it gives the mistaken impression that there are easy work-arounds to dealing with PDF. Never mind the mainstream opposition we will face, nor our own brethren we will anger when what poor tools there are go away. Organizations like the FDA will either break or be given more money to do this aspect of their job properly. 2) Accept the status quo. We should be grateful for what tools are handed us and we can plead for help on a case-by-case basis when they don't work. In the meantime, we can educate, much like we do with the majority of WAI issues. Hopefully, work on PDF translation will continue. I would guess that the kind of optical character recognition that is needed involves the same kind of artificial intelligence that is needed for understanding language and real voice recognition. At a minimum the Acrobat Reader (and Access Adobe) should warn the user when there is no text associated with the images displayed. I would like to Access Adobe offering free state of the art OCR for PDF documents. Bruce bbailey@clark.net
Received on Thursday, 3 September 1998 11:21:51 UTC