- From: Ian Sharpe <themanxsharpy@gmail.com>
- Date: Sat, 2 Mar 2013 20:18:54 -0000
- To: "'David Woolley'" <forums@david-woolley.me.uk>, <w3c-wai-ig@w3.org>
I'm no expert in PDF accessibility, tagging etc. But having worked on facial image recognition software over 15 years ago now and loosely followed progress in this area, I am really surprised that current OCR technology couldn't make at least a decent stab at automating the tagging process of scanned documents. I do totally appreciate that there are going to be times when an automated tagging approach might struggle, providing say alternative text for images for example (although maybe even that is starting to become possible these days), but surely it would be good enough to provide enough information to significantly improve the accessibility of the untagged document? Is it simply the case that nobody has chosen to use todays scanning and analysis technology to produce a tagged document or am I missing something? Apart from images, the only problem I can think of off the top of my head is how OCR technology could work out where a link references, but maybe there are other ways to obtain this information. As I said though, I'm not an expert in this area and am just curious to understand the problem. Cheers Ian -----Original Message----- From: David Woolley [mailto:forums@david-woolley.me.uk] Sent: 02 March 2013 09:47 To: w3c-wai-ig@w3.org Subject: Re: Accessible PDF Repair Lars Ballieu Christensen wrote: > > You may want to consider the automated PDF conversion features of > RoboBraille. You can use the RoboBraille service to convert all types > of pdf files into more accessible formats, including tagged pdf. > Although there are heuristics that will often successfully detect re-flowable text, and there are even reasonable heuristics for working out word spaces in micro-spaced documents that didn't use the PDF support for micro-spacing (most Windows generated PDF contains no spaces and outputs printable characters without associating them into words and with a move between each character), I don't believe the state of AI is currently up to a level where it could properly tag a final form document, unless it had a machine readable definition of the style sheet and the document was properly authored to that style sheet. Note I don't mean a CSS style sheet; I mean a style I would be given to a human author. Although the SS in CSS comes from that concept, the way it is often used is not like the way that one would be used for a human author. Even with a style sheet, one would not be able to distinguish between the standard renderings of citation and emphasis, in Western languages, so one would have to tag them presentationally, as italics. To do otherwise, would require language understanding that goes beyond current internet machine translation capabilities. I'd therefore take any claim to recover tagged PDF, from pure final form PDF, with a pinch of salt. Basically, only humans can tag documents with any reasonable level of reliability, which makes it expensive, and is why documents which were not tagged properly when first written, are unlikely to get properly tagged thereafter. Also, I haven't tried the tools, but if they work on PDFs marked as copy and paste disallowed, I would have concerns that they may violate the DMCA, and the equivalent UK, etc., copyright law provisions. Accessibility interfaces tend to get some dispensation from copy protection schemes on the understanding that they are only used to create transient versions for the end user, not to extract the text into a revisable form. -- David Woolley Emails are not formal business letters, whatever businesses may want. RFC1855 says there should be an address here, but, in a world of spam, that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 2 March 2013 20:19:26 UTC