W3C home > Mailing lists > Public > w3c-wai-ig@w3.org > January to March 2013

RE: Accessible PDF Repair

From: Ian Sharpe <themanxsharpy@gmail.com>
Date: Sat, 2 Mar 2013 20:18:54 -0000
To: "'David Woolley'" <forums@david-woolley.me.uk>, <w3c-wai-ig@w3.org>
Message-ID: <F354A7F4BEEF4169BBB2727B7D099783@BLACKBOX>
I'm no expert in PDF accessibility, tagging etc. But having worked on facial
image recognition software over 15 years ago now and loosely followed
progress in this area, I am really surprised that current OCR technology
couldn't make at least a decent stab at automating the tagging process of
scanned documents.

I do totally appreciate that there are going to be times when an automated
tagging approach  might struggle, providing say alternative text for images
for example (although maybe even that is starting to become possible these
days), but surely it would be good enough to provide enough information to
significantly improve the accessibility of the untagged document?

Is it simply the case that nobody has chosen to use todays scanning and
analysis technology to produce a tagged document or am I missing something? 

Apart from images, the only problem I can think of off the top of my head is
how OCR technology could work out where a link references, but maybe there
are other ways to obtain this information.

As I said though, I'm not an expert in this area and am just curious to
understand the problem.

Cheers
Ian





-----Original Message-----
From: David Woolley [mailto:forums@david-woolley.me.uk] 
Sent: 02 March 2013 09:47
To: w3c-wai-ig@w3.org
Subject: Re: Accessible PDF Repair

Lars Ballieu Christensen wrote:
> 
> You may want to consider the automated PDF conversion features of 
> RoboBraille. You can use the RoboBraille service to convert all types 
> of pdf files into more accessible formats, including tagged pdf.
> 

Although there are heuristics that will often successfully detect 
re-flowable text, and there are even reasonable heuristics for working 
out word spaces in micro-spaced documents that didn't use the PDF 
support for micro-spacing (most Windows generated PDF contains no spaces 
and outputs printable characters without associating them into words and 
with a move between each character), I don't believe the state of AI is 
currently up to a level where it could properly tag a final form 
document, unless it had a machine readable definition of the style sheet 
and the document was properly authored to that style sheet.

Note I don't mean a CSS style sheet; I mean a style I would be given to 
a human author.  Although the SS in CSS comes from that concept, the way 
it is often used is not like the way that one would be used for a human 
author.

Even with a style sheet, one would not be able to distinguish between 
the standard renderings of citation and emphasis, in Western languages, 
so one would have to tag them presentationally, as italics.  To do 
otherwise, would require language understanding that goes beyond current 
internet machine translation capabilities.

I'd therefore take any claim to recover tagged PDF, from pure final form 
PDF, with a pinch of salt.  Basically, only humans can tag documents 
with any reasonable level of reliability, which makes it expensive, and 
is why documents which were not tagged properly when first written, are 
unlikely to get properly tagged thereafter.

Also, I haven't tried the tools, but if they work on PDFs marked as copy 
and paste disallowed, I would have concerns that they may violate the 
DMCA, and the equivalent UK, etc., copyright law provisions. 
Accessibility interfaces tend to get some dispensation from copy 
protection schemes on the understanding that they are only used to 
create transient versions for the end user, not to extract the text into 
a revisable form.


-- 
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 2 March 2013 20:19:26 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 2 March 2013 20:19:26 GMT