W3C home > Mailing lists > Public > w3c-wai-ig@w3.org > January to March 2013

RE: Accessible PDF Repair

From: <accessys@smart.net>
Date: Sat, 2 Mar 2013 15:36:30 -0500 (EST)
To: Ian Sharpe <themanxsharpy@gmail.com>
cc: "'David Woolley'" <forums@david-woolley.me.uk>, w3c-wai-ig@w3.org
Message-ID: <Pine.LNX.4.60.1303021533110.6546@cygnus.smart.net>

I think you may have hit the nail on the head, no one considers the 
community of people with disabilities worth spending the research and 
money on.

what is avaliable is
. cobbled together from some other use (pdf)
. very expensive (Jaws)
. or done by users (eMacspeak)
as examples

no one has really taken the project on as a serious project, the economic 
returns on investment are not "percieved" to be there


On Sat, 2 Mar 2013, Ian Sharpe wrote:

> Date: Sat, 2 Mar 2013 20:18:54 -0000
> From: Ian Sharpe <themanxsharpy@gmail.com>
> To: 'David Woolley' <forums@david-woolley.me.uk>, w3c-wai-ig@w3.org
> Subject: RE: Accessible PDF Repair
> Resent-Date: Sat, 02 Mar 2013 20:19:27 +0000
> Resent-From: w3c-wai-ig@w3.org
> I'm no expert in PDF accessibility, tagging etc. But having worked on facial
> image recognition software over 15 years ago now and loosely followed
> progress in this area, I am really surprised that current OCR technology
> couldn't make at least a decent stab at automating the tagging process of
> scanned documents.
> I do totally appreciate that there are going to be times when an automated
> tagging approach  might struggle, providing say alternative text for images
> for example (although maybe even that is starting to become possible these
> days), but surely it would be good enough to provide enough information to
> significantly improve the accessibility of the untagged document?
> Is it simply the case that nobody has chosen to use todays scanning and
> analysis technology to produce a tagged document or am I missing something?
> Apart from images, the only problem I can think of off the top of my head is
> how OCR technology could work out where a link references, but maybe there
> are other ways to obtain this information.
> As I said though, I'm not an expert in this area and am just curious to
> understand the problem.
> Cheers
> Ian
> -----Original Message-----
> From: David Woolley [mailto:forums@david-woolley.me.uk]
> Sent: 02 March 2013 09:47
> To: w3c-wai-ig@w3.org
> Subject: Re: Accessible PDF Repair
> Lars Ballieu Christensen wrote:
>> You may want to consider the automated PDF conversion features of
>> RoboBraille. You can use the RoboBraille service to convert all types
>> of pdf files into more accessible formats, including tagged pdf.
> Although there are heuristics that will often successfully detect
> re-flowable text, and there are even reasonable heuristics for working
> out word spaces in micro-spaced documents that didn't use the PDF
> support for micro-spacing (most Windows generated PDF contains no spaces
> and outputs printable characters without associating them into words and
> with a move between each character), I don't believe the state of AI is
> currently up to a level where it could properly tag a final form
> document, unless it had a machine readable definition of the style sheet
> and the document was properly authored to that style sheet.
> Note I don't mean a CSS style sheet; I mean a style I would be given to
> a human author.  Although the SS in CSS comes from that concept, the way
> it is often used is not like the way that one would be used for a human
> author.
> Even with a style sheet, one would not be able to distinguish between
> the standard renderings of citation and emphasis, in Western languages,
> so one would have to tag them presentationally, as italics.  To do
> otherwise, would require language understanding that goes beyond current
> internet machine translation capabilities.
> I'd therefore take any claim to recover tagged PDF, from pure final form
> PDF, with a pinch of salt.  Basically, only humans can tag documents
> with any reasonable level of reliability, which makes it expensive, and
> is why documents which were not tagged properly when first written, are
> unlikely to get properly tagged thereafter.
> Also, I haven't tried the tools, but if they work on PDFs marked as copy
> and paste disallowed, I would have concerns that they may violate the
> DMCA, and the equivalent UK, etc., copyright law provisions.
> Accessibility interfaces tend to get some dispensation from copy
> protection schemes on the understanding that they are only used to
> create transient versions for the end user, not to extract the text into
> a revisable form.
> --
> David Woolley
> Emails are not formal business letters, whatever businesses may want.
> RFC1855 says there should be an address here, but, in a world of spam,
> that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 2 March 2013 20:37:43 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 13 October 2015 16:21:47 UTC