Re: Accessible PDF Repair from David Woolley on 2013-03-02 (w3c-wai-ig@w3.org from January to March 2013)

From: David Woolley <forums@david-woolley.me.uk>
Date: Sat, 02 Mar 2013 20:50:18 +0000
To: w3c-wai-ig@w3.org
Message-ID: <5132660A.8020108@david-woolley.me.uk>

Ian Sharpe wrote:
> I'm no expert in PDF accessibility, tagging etc. But having worked on facial
> image recognition software over 15 years ago now and loosely followed
> progress in this area, I am really surprised that current OCR technology
> couldn't make at least a decent stab at automating the tagging process of
> scanned documents.

I'm not sure that we are really talking about scanned documents, 
although there are scanned documents in PDF that don't have an OCR 
underlay, especially when people are trying to avoid the cost of the 
Adobe tools.

The problems I see are in recovering things like heading levels, block 
quotes, correctly identifying list levels, etc.  A particular problem 
with some documents will be that they have been composed 
presentationally, and the styling may not be consistent enough to allow 
an automated tool to correctly reverse engineer it without deep 
understanding of the content.

Another risk area is false positives, for things like identifying page 
headings.

I used the cite/emphasis distinction as an example and I'm going by the 
translation abilities of things like Bablelfish and Google Translate to 
indicate that tools don't have the semantic understanding to distinguish 
between those. (In fact, my understanding is that Google Translate 
really has no deep understanding and works on statistical patterns.

Even with things like reflowability, I am sure that automated tools will 
make wrong decisions.  The extreme case would be detecting and avoiding 
reflowing poetry if lines happened to be near full.

Particularly difficult things would be magazine articles, where the tail 
of the article is on a non-adjacent page.

-- 
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

Received on Saturday, 2 March 2013 20:50:45 UTC