- From: David Woolley <forums@david-woolley.me.uk>
- Date: Sat, 02 Mar 2013 09:47:08 +0000
- To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Lars Ballieu Christensen wrote: > > You may want to consider the automated PDF conversion features of > RoboBraille. You can use the RoboBraille service to convert all types of > pdf files into more accessible formats, including tagged pdf. > Although there are heuristics that will often successfully detect re-flowable text, and there are even reasonable heuristics for working out word spaces in micro-spaced documents that didn't use the PDF support for micro-spacing (most Windows generated PDF contains no spaces and outputs printable characters without associating them into words and with a move between each character), I don't believe the state of AI is currently up to a level where it could properly tag a final form document, unless it had a machine readable definition of the style sheet and the document was properly authored to that style sheet. Note I don't mean a CSS style sheet; I mean a style I would be given to a human author. Although the SS in CSS comes from that concept, the way it is often used is not like the way that one would be used for a human author. Even with a style sheet, one would not be able to distinguish between the standard renderings of citation and emphasis, in Western languages, so one would have to tag them presentationally, as italics. To do otherwise, would require language understanding that goes beyond current internet machine translation capabilities. I'd therefore take any claim to recover tagged PDF, from pure final form PDF, with a pinch of salt. Basically, only humans can tag documents with any reasonable level of reliability, which makes it expensive, and is why documents which were not tagged properly when first written, are unlikely to get properly tagged thereafter. Also, I haven't tried the tools, but if they work on PDFs marked as copy and paste disallowed, I would have concerns that they may violate the DMCA, and the equivalent UK, etc., copyright law provisions. Accessibility interfaces tend to get some dispensation from copy protection schemes on the understanding that they are only used to create transient versions for the end user, not to extract the text into a revisable form. -- David Woolley Emails are not formal business letters, whatever businesses may want. RFC1855 says there should be an address here, but, in a world of spam, that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 2 March 2013 09:47:34 UTC