W3C home > Mailing lists > Public > w3c-wai-ig@w3.org > January to March 2013

Re: Accessible PDF Repair

From: David Woolley <forums@david-woolley.me.uk>
Date: Sat, 02 Mar 2013 09:47:08 +0000
Message-ID: <5131CA9C.7060903@david-woolley.me.uk>
To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Lars Ballieu Christensen wrote:
> 
> You may want to consider the automated PDF conversion features of 
> RoboBraille. You can use the RoboBraille service to convert all types of 
> pdf files into more accessible formats, including tagged pdf. 
> 

Although there are heuristics that will often successfully detect 
re-flowable text, and there are even reasonable heuristics for working 
out word spaces in micro-spaced documents that didn't use the PDF 
support for micro-spacing (most Windows generated PDF contains no spaces 
and outputs printable characters without associating them into words and 
with a move between each character), I don't believe the state of AI is 
currently up to a level where it could properly tag a final form 
document, unless it had a machine readable definition of the style sheet 
and the document was properly authored to that style sheet.

Note I don't mean a CSS style sheet; I mean a style I would be given to 
a human author.  Although the SS in CSS comes from that concept, the way 
it is often used is not like the way that one would be used for a human 
author.

Even with a style sheet, one would not be able to distinguish between 
the standard renderings of citation and emphasis, in Western languages, 
so one would have to tag them presentationally, as italics.  To do 
otherwise, would require language understanding that goes beyond current 
internet machine translation capabilities.

I'd therefore take any claim to recover tagged PDF, from pure final form 
PDF, with a pinch of salt.  Basically, only humans can tag documents 
with any reasonable level of reliability, which makes it expensive, and 
is why documents which were not tagged properly when first written, are 
unlikely to get properly tagged thereafter.

Also, I haven't tried the tools, but if they work on PDFs marked as copy 
and paste disallowed, I would have concerns that they may violate the 
DMCA, and the equivalent UK, etc., copyright law provisions. 
Accessibility interfaces tend to get some dispensation from copy 
protection schemes on the understanding that they are only used to 
create transient versions for the end user, not to extract the text into 
a revisable form.


-- 
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 2 March 2013 09:47:34 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 2 March 2013 09:47:35 GMT