RE: Accessible PDF Repair

Apologies if I've misunderstood the use case here but I was talking about
the use of OCR technology to produce an accessible version of any PDF
document. This could be in the form of a tagged PDF version of the same
document or any other format for that matter. 

My conjecture is that if software exists to recognise faces in complex
images of varying degrees of quality, surely it is possible to determine the
structure of a document based on visual presentation such as indentation,
font size, font weight, font style etc. 

After all, the visual presentation is designed to convey meaning to the
reader. I'm not saying there aren't situations when an automated approach
would get it right, although I'm struggling to think of an example myself. 

I'm not exactly sure what you mean by a magazine article which flows onto a
non-ajacent page though, but this is probably because I have never been able
to read them. Why would an article be continued on a non-ajacent page?

Whatever the case, I suspect that for the most part, an automated approach
should be able to produce a reasonable representation of an image which
surely is better than not having access to the document at all. It would
also mean that thousands of documents could be processed very quickly saving
time and cost during archiving for example.

I'm certainly not disputing the complexity of this problem, nor the fact
that there may not be a solution. I'm just genuinely surprised given my
experience that a reasonable solution doesn't exist and am curious to
understand why.

Cheers
Ian







 

Obviously things like alternative text for images would still need to be
added manually for the time being but I doubt this is a serious problem in
the majority of cases.

  













-----Original Message-----
From: David Woolley [mailto:forums@david-woolley.me.uk] 
Sent: 02 March 2013 20:50
To: w3c-wai-ig@w3.org
Subject: Re: Accessible PDF Repair

Ian Sharpe wrote:
> I'm no expert in PDF accessibility, tagging etc. But having worked on 
> facial image recognition software over 15 years ago now and loosely 
> followed progress in this area, I am really surprised that current OCR 
> technology couldn't make at least a decent stab at automating the 
> tagging process of scanned documents.

I'm not sure that we are really talking about scanned documents, although
there are scanned documents in PDF that don't have an OCR underlay,
especially when people are trying to avoid the cost of the Adobe tools.

The problems I see are in recovering things like heading levels, block
quotes, correctly identifying list levels, etc.  A particular problem with
some documents will be that they have been composed presentationally, and
the styling may not be consistent enough to allow an automated tool to
correctly reverse engineer it without deep understanding of the content.

Another risk area is false positives, for things like identifying page
headings.

I used the cite/emphasis distinction as an example and I'm going by the
translation abilities of things like Bablelfish and Google Translate to
indicate that tools don't have the semantic understanding to distinguish
between those. (In fact, my understanding is that Google Translate really
has no deep understanding and works on statistical patterns.

Even with things like reflowability, I am sure that automated tools will
make wrong decisions.  The extreme case would be detecting and avoiding
reflowing poetry if lines happened to be near full.

Particularly difficult things would be magazine articles, where the tail of
the article is on a non-adjacent page.




-- 
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

Received on Saturday, 2 March 2013 23:12:16 UTC