RE: Removing PDFs and accessibility from Andrew Kirkpatrick on 2012-03-26 (w3c-wai-ig@w3.org from January to March 2012)

From: Andrew Kirkpatrick <akirkpat@adobe.com>
Date: Mon, 26 Mar 2012 10:37:51 -0700
To: "Ozi, Selim" <sozi1@mscd.edu>, "wed@csulb.edu" <wed@csulb.edu>, David Woolley <forums@david-woolley.me.uk>
CC: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Message-ID: <EE43A638A0C5E34E80AF78EFE940FC2C019BC138@nambx09.corp.adobe.com>
Hmm.  Acrobat does indicate that a file may be an scanned PDF, and can also tell if the file is "not scanned and not tagged".  Of course it can tell if it is tagged, but there is still human verification to determine if the file is well-tagged or not.

I'm not sure what type of external SDK interface you have to each of these results, but there may be some possibilities.  The place to check out is the PDF Accessibility API reference found here: http://www.adobe.com/devnet/acrobat/interapplication_communication.html.  I'll see what I can figure out on my end.

AWK

-----Original Message-----
From: Ozi, Selim [mailto:sozi1@mscd.edu] 
Sent: Monday, March 26, 2012 1:25 PM
To: Andrew Kirkpatrick; wed@csulb.edu; David Woolley
Cc: w3c-wai-ig@w3.org
Subject: RE: Removing PDFs and accessibility

Andrew Thank you for your quick respond.
My Question/suggestion is a little unconventional :)

If Adobe can differentiate the saving format  either as image(pdf/I)/OCR'ed(pdf/O)/Tagged-reading order (pdf/T) So users can  save on default formats only .
Such as if I am faculty, found an html article  on line and save it as pdf to post it on my LMS, Article should only be saved by default PDF/O ( ocr'ed) as a student using screen reader when my jaws will detect PDF/O this will tell me its text but not tagged/touch up reading order , no alt text so that  I can skip the pdf/ Image/o ocr file and go to the next alternative.

If I am photocopying/scanning and article not using Acrobat pro and creating image files this can be identified for user as PDF/I so end user using At will have option to choose to view or not.

This option will allow software engineers like me to code easy scripts on my blackboard servers, web accessibility checkers etc.. to detect accessibility issues.
If my web content management systems/LMS can detect if there is a PDF/I posted with no alternative (image only file is posted by content provider) this will give me time to repair it and be WCAG2.0AA compliant by giving PDF/T or Structured HTML5..
Thanks.

Thanks.
Selim Özi
Access Center.
Metropolitan State University of Denver.
www.mscd.edu/access


-----Original Message-----
From: Andrew Kirkpatrick [mailto:akirkpat@adobe.com]
Sent: Monday, March 26, 2012 10:58 AM
To: Ozi, Selim; wed@csulb.edu; David Woolley
Cc: w3c-wai-ig@w3.org
Subject: RE: Removing PDFs and accessibility

Selim,
We do have a number of possible options:
Authors using Acrobat can:
1) Save a PDF to HTML, Word, RTF, or Text
2) perform OCR on a scanned document and deliver as PDF or HTML, Word, RTF, or Text
3) Add tags to untagged documents, whether starting as a scanned document or not

End users with Reader can:
1) Read  document as PDF
2) Export to .txt from Reader directly
3) Upload a file, including scanned PDF files, to an additional service called Export PDF (https://www.acrobat.com/exportpdf/en/home.html) which can export the PDF document as Word, Excel, or RTF, and will perform OCR on scanned files first.  This is an additional service, which costs $19,99/year.

I'm not sure if I covered all of the questions you raised, let me know if not.
AWK

-----Original Message-----
From: Ozi, Selim [mailto:sozi1@mscd.edu]
Sent: Monday, March 26, 2012 12:36 PM
To: Andrew Kirkpatrick; wed@csulb.edu; David Woolley
Cc: w3c-wai-ig@w3.org
Subject: RE: Removing PDFs and accessibility

Great thread of information about Web, Accessibility,PDF.
My question is to Andrew:
Can Adobe, allow the user to  choose which PDF format to create /save/view? 
Something like below:
1- PDF / I  = image PDF
2- PDF/ O  = OCR
3- PDF/ TR = Tagged/touchup reading ordered

This way  developers/ Content providers, provide information to end user to choose the median if user would like to view this format with a screen reader or choose to go to structured HTML format?

Thanks.
Selim Özi
Accessbile technology Specialist
Access Center.
Metropolitan State University of Denver.
www.mscd.edu/access


-----Original Message-----
From: Andrew Kirkpatrick [mailto:akirkpat@adobe.com]
Sent: Monday, March 26, 2012 9:49 AM
To: wed@csulb.edu; David Woolley
Cc: w3c-wai-ig@w3.org
Subject: RE: Removing PDFs and accessibility

Unfortunately the original post doesn't allow comments.  My gripe with this post is that it makes many false claims and uses the false claims as evidence to support a conclusion which may be true, but there is no actual data or scientific rigor offered, which makes this interesting as anecdotal data, but nothing more.  I'd like to see more information on the study performed, and offer the following questions to consider.

>From the article, with comments:
Mark said major disadvantages of PDFs include:
*	not showing up in search results
PDF documents do show up in search results.  Google and Bing both index and include PDF documents in search results.

*	failing Australian Human Rights Commission requirements for being accessible to people with a disability, such as compatibility with screen readers
Differences do exist, to be sure, but NVDA, as a free screen reader on Windows provides nearly the same level of support as JAWS (support for headings is one of the main issues remaining and I expect we'll see that addressed soon).  VoiceOver with PDF documents on the Mac is not as good as the Windows options but the document content can be read and used.  The level of support is better than what is provided by a text only or RTF document which the AHRC does suggest is sufficient.
I realize that this department is in the state government, but it is worth noting that AGIMO in the federal government agrees that well-authored PDF documents can meet WCAG 2.0 and can be used within the government to comply with the National Transition Strategy:

(http://agimo.govspace.gov.au/2012/01/12/release-of-wcag-2-0-techniques-for-pdf/comment-page-1/#comment-5632) "As stated, the PDF Sufficient Techniques are now available, so technically an agency can rely on PDF by using the WCAG 2.0 PDF Sufficient Techniques and all applicable General Techniques, and will be considered to be complying with the NTS. This addresses one of the findings of our PDF study by ensuring the design of the PDF file is optimised for accessibility."

More on this in a bit...

*	penalising people who have slow internet connections
*	often extremely large document sizes.
These are really the same point, so I'll address them together.  Some PDF documents do get rather large, some outrageously so.  However, PDF documents can and should be authored to be as light as possible, so while it may be that a 300 page report is large no matter what an author does, PDF documents in general need not be bloated in size and authors who are tending to their work can easily avoid this.  Adobe Acrobat also offers a batch process which can watch a specific folder and when PDF documents are added there it can take the steps to reduce the file size automatically if desired.  Others have commented on the convenience of PDF documents for users also, so at a minimum offering a PDF document for some documents can be viewed as helping some users. 

Back to the main question:  Does replacing PDF documents with HTML documents increase web traffic?   I don't know the answer, but I am certain that the answer is not as simple as a quick look at the server log data.  There are complicated questions to be asked:

1)	were the PDF documents that were replaced built as tagged PDF documents to maximize their accessibility?
2)	How much of the additional traffic was bots?   Give a recent study on the amount of internet traffic that is non-human (http://www.itproportal.com/2012/03/14/51-internet-traffic-non-human/#ixzz1p7FFrR84) and the broad introduction of new pages and links I wonder whether a percentage that is greater than the 51% cited in the Incapsula report because spiders and other bots may be exploring the new pages.  (disclaimer - I haven't read the Incapsula report in any depth and can't say whether it is accurate or whether there are reasons that it may not be similar in the Victoria DPI case).
3)	What methodology for measuring the results was used?  If it is just hits on a page, it might make sense that going from 6000 pages and 9000 PDF files (15K URI) to 22000 HTML pages would result in a larger number of hits.  Some quick "back of the envelope" math shows that there are now 1.47 times the number of indexable pages now and the number of hits has risen by a factor of 1.38.
4)	Is it possible to review a collection of 10-20 representative PDF documents and the HTML analogs for them and see how the stats for those specific documents break down?  That would be interesting.

I'm sure that there are other interesting questions, but that's a start.

To the question of whether you should take this approach and replace your PDF documents with HTML files - maybe you should, but I'm not convinced that the hit count is a reason that you can depend on.  If you are hearing from your users that they prefer HTML files over PDF, then offer HTML.  If you are finding that maintenance is easier with another format, use that other format.  There are many reasons why you may want to offer HTML documents, but you should also recognize that there are valid reasons for using PDF documents, and if you find that these reasons make sense for you, use PDF.  But, when you do use PDF, follow best practices for making sure the PDF documents meet WCAG 2.0.

Thanks,
AWK

Andrew Kirkpatrick
Group Product Manager, Accessibility
Adobe Systems 

akirkpat@adobe.com
http://twitter.com/awkawk
http://blogs.adobe.com/accessibility


-----Original Message-----
From: Wayne Dick [mailto:wayneedick@gmail.com]
Sent: Sunday, March 25, 2012 2:54 PM
To: David Woolley
Cc: w3c-wai-ig@w3.org
Subject: Re: Removing PDFs and accessibility

Just making an attempt to move away from PDF as a system to view web content is great move forward.  It recognizes the issue that PDF is a poor online reading medium for many people with visual impairments.
Thank you Cosmic Muffin.

The primary application will be in the area of content meant for reading.  When article is written in PDF it generally increases the workload for reading on line, especially for a person with low vision.
 This generally involves a significant change in workload.  Since most sighted people just print PDF articles, this introduces a major inequality of work for people with full sight vs. people with partial sight.

The ability to obtain high quality will be the trick.  The tag spaces are not isomrphic, and tagged PDF enables meaningful text styling to be embedded in blocks of untagged data.  As such I do not see a programatically determined method of translation existing.  However a good heuristic will probably suffice.

Thanks for the article, good luck Victoria.

Wayne Dick

On 3/25/12, David Woolley <forums@david-woolley.me.uk> wrote:
> David Woolley wrote:
>
>>
>> Incidentally, I have often sought out PDFs because they are not 
>> fragmented into pages,
>
> The big problem I often find with lots of small hyperlinked pages, on 
> sites (typically governmental, or software support) that should be 
> information rich, is that one ends up going round circles, never 
> actually getting to the detail you want.  I suspect that is often 
> because that level of detail just does not exist, but unless one maps 
> out the whole site and proves that you have seen all the pages, one 
> can never be sure of that.
>
> A single, linearised, document makes it much easier for the reader to 
> be sure that information is not present and makes it much harder for 
> the author to avoid answering difficult questions by just hyperlinking 
> you backwards and forwards.
>
>
>
Received on Monday, 26 March 2012 17:38:37 UTC