W3C home > Mailing lists > Public > w3c-wai-ig@w3.org > January to March 2013

Re: Accessible PDF Repair

From: <accessys@smart.net>
Date: Sat, 2 Mar 2013 16:39:49 -0500 (EST)
To: David Woolley <forums@david-woolley.me.uk>
cc: w3c-wai-ig@w3.org
Message-ID: <Pine.LNX.4.60.1303021633130.6546@cygnus.smart.net>

as an interperter for a language I understand the problems with 
interperting or translating something from one format to another. and then 
if it is a scanned historic document it gets even harder as the grammer or 
spellings may be different
  I believe that it could be automated to a point but I cannot see at 
current technology levels being able to automate conversions if the work 
was not originally written or marked up for that purpose, there would 
always need to be a little jellyware (human brain) added to the mix to 
make it successfull.  now automation could certainly speed up the process 
and maybe eliminate some of the repetitive actions but still need the 
person.

wish I could believe the technology is ready. but it is still a work in 
progress thus we have to preach to all that they have to write up 
documents properly.  sometimes one or two extra keystrokes by the author 
could save days of work by the person trying to make it accessible.

it is especially usefull if one can get the document in some computer 
format that could be converted but this is often not the case..

Bob


On Sat, 2 Mar 2013, David Woolley wrote:

> Date: Sat, 02 Mar 2013 20:50:18 +0000
> From: David Woolley <forums@david-woolley.me.uk>
> To: w3c-wai-ig@w3.org
> Subject: Re: Accessible PDF Repair
> Resent-Date: Sat, 02 Mar 2013 20:50:46 +0000
> Resent-From: w3c-wai-ig@w3.org
> 
> Ian Sharpe wrote:
>> I'm no expert in PDF accessibility, tagging etc. But having worked on 
>> facial
>> image recognition software over 15 years ago now and loosely followed
>> progress in this area, I am really surprised that current OCR technology
>> couldn't make at least a decent stab at automating the tagging process of
>> scanned documents.
>
> I'm not sure that we are really talking about scanned documents, although 
> there are scanned documents in PDF that don't have an OCR underlay, 
> especially when people are trying to avoid the cost of the Adobe tools.
>
> The problems I see are in recovering things like heading levels, block 
> quotes, correctly identifying list levels, etc.  A particular problem with 
> some documents will be that they have been composed presentationally, and the 
> styling may not be consistent enough to allow an automated tool to correctly 
> reverse engineer it without deep understanding of the content.
>
> Another risk area is false positives, for things like identifying page 
> headings.
>
> I used the cite/emphasis distinction as an example and I'm going by the 
> translation abilities of things like Bablelfish and Google Translate to 
> indicate that tools don't have the semantic understanding to distinguish 
> between those. (In fact, my understanding is that Google Translate really has 
> no deep understanding and works on statistical patterns.
>
> Even with things like reflowability, I am sure that automated tools will make 
> wrong decisions.  The extreme case would be detecting and avoiding reflowing 
> poetry if lines happened to be near full.
>
> Particularly difficult things would be magazine articles, where the tail of 
> the article is on a non-adjacent page.
>
>
>
>
> -- 
> David Woolley
> Emails are not formal business letters, whatever businesses may want.
> RFC1855 says there should be an address here, but, in a world of spam,
> that is no longer good advice, as archive address hiding may not work.
>
Received on Saturday, 2 March 2013 21:40:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 2 March 2013 21:40:29 GMT