Re: Adobe And TRACE Launch Enhanced PDF Access Via Email from Bruce Bailey on 1998-09-03 (w3c-wai-ig@w3.org from July to September 1998)

From: Bruce Bailey <bbailey@clark.net>
Date: Thu, 03 Sep 1998 11:24:32 -0400
To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
CC: Paul Stauffer 301-827-5694 FAX 301-443-6385 <STAUFFERP@cder.fda.gov>, Robert Neff <rcn@fenix2.dol-esa.gov>, "T. V. Raman" <raman@Adobe.COM>
Message-ID: <35EEB4AF.78F78130@clark.net>

Much thanks to Robert Neff and Paul Stauffer.  I'll share the good stuff
(links) first and then proceed with my pontificating!

Skeptic that I am, Paul has pointed me to a mainstream public site where PDFs
are, in my humble opinion, good and appropriate.  Some examples (and what
happens with the Access Adobe translation):
 http://www.fda.gov/cder/guidance/1326fnl.pdf :  Started life as a word
processing document and is pretty small compared to the other samples
referenced.  The text comes through perfectly.  Most formatting is lost,
although bold and italics are retained.  The graphic and emphasis from title
page are lost.
 http://www.fda.gov/cder/guidance/1716dft.pdf :  Also originally a word
processing document, but this one contains formulas written using built-in
tools (as opposed to being pasted in as a graphic).  These equations are
handled in a consistent fashion, but too much information is lost to be usable.

 http://www.fda.gov/cder/foi/nda/97/020184ap.pdf : Is a good example of a
composite document; it includes text of varying quality, graphical formulas,
and attachments of typed in forms.  The whole thing has been run through a
decent OCR process, but with predictable results.  I would guess it is about
95% accurate.  (Those who use OCR daily will tell you that this level of
accuracy is not acceptable.)  The formulas are total gibberish and misspelled
words and artifacts abound.
 http://www.fda.gov/cder/guidance/old098fn.pdf : Is an older document, and has
not been run through OCR.  The translation shows only the page numbers!  No
warning messages of any kind are generated.

It is interesting to note that all of the above look similar in an Acrobat
Reader window!  The casual browser would have no way of knowing which is an
image and which has text.  A word search returns only "not found", even when
there is no text that can be scanned!

It was interesting to me that Access Adobe return translations to me faster
than Acrobat Reader.  This is not too surprising given that both Adobe and the
FDA have T3 connections to the internet and I am using dial-up!  Were the
translations better, no doubt the service would be totally over whelmed by home
users.  I postulate that it is in Adobe's interest to keep this free service
mediocre!  The curious might wish to try pasting the above links through their
simple form (http://access.adobe.com/simple_form.html).

The Food and Drug Administration basically has to choose between providing
copious PDFs or trickling out HTML.  The documents are coming from a variety of
sources (including paper).  This is the kind of devil-in-the-detail choice
disability rights advocates are regularly faced with.  Do we wage the
impossible war against the system (and thereby be true to our ideals, but in
the meantime accomplish little) or work from within the system to effect change
(and in the meantime feel compromised, and probably give up the chance for
radical improvements).  Given the current (less than acceptable) state of the
art with regard to PDF access, which (less than ideal) goal do we pursue:
1)  Purge PDF from the web with the same vigor we fight missing ALT text.  This
would include removing Access Adobe, since it gives the mistaken impression
that there are easy work-arounds to dealing with PDF.  Never mind the
mainstream opposition we will face, nor our own brethren we will anger when
what poor tools there are go away.  Organizations like the FDA will either
break or be given more money to do this aspect of their job properly.
2)  Accept the status quo.  We should be grateful for what tools are handed us
and we can plead for help on a case-by-case basis when they don't work.  In the
meantime, we can educate, much like we do with the majority of WAI issues.

Hopefully, work on PDF translation will continue.  I would guess that the kind
of optical character recognition that is needed involves the same kind of
artificial intelligence that is needed for understanding language and real
voice recognition.

At a minimum the Acrobat Reader (and Access Adobe) should warn the user when
there is no text associated with the images displayed.  I would like to Access
Adobe offering free state of the art OCR for PDF documents.

Bruce
bbailey@clark.net

Received on Thursday, 3 September 1998 11:21:51 UTC