Re: Accessible PDF Repair from Ramón Corominas on 2013-03-03 (w3c-wai-ig@w3.org from January to March 2013)

From: Ramón Corominas <listas@ramoncorominas.com>
Date: Sun, 03 Mar 2013 13:52:45 +0100
To: Ian Sharpe <isforums@manx.net>
CC: w3c-wai-ig@w3.org
Message-ID: <5133479D.8040106@ramoncorominas.com>

Hi, Ian and all.

Indeed, Adobe Acrobat Professional can perform OCR and there is also an 
option to create automated tagging, I guess based on those features you 
mentioned (font size, style and so on).

Unfortunately, OCR does not always work, depending on the type of 
document (for example if there are watermarks, lines, tables, form 
controls...), and the automated tagging is far from perfect. Sometimes 
it creates headings that should not be headings, or mark real headings 
as normal paragraphs when they should be marked as headings. Tables are 
sometimes properly tagged, but not always, links are not detected (and 
indeed cannot be detected if you only have the text but not the URL), 
and from my experience form controls have to be manually tagged.

Cheers,
Ramón.

Ian said:

> Apologies if I've misunderstood the use case here but I was talking about
> the use of OCR technology to produce an accessible version of any PDF
> document. This could be in the form of a tagged PDF version of the same
> document or any other format for that matter. 
> 
> My conjecture is that if software exists to recognise faces in complex
> images of varying degrees of quality, surely it is possible to determine the
> structure of a document based on visual presentation such as indentation,
> font size, font weight, font style etc. 
> 
> After all, the visual presentation is designed to convey meaning to the
> reader. I'm not saying there aren't situations when an automated approach
> would get it right, although I'm struggling to think of an example myself. 
> 
> I'm not exactly sure what you mean by a magazine article which flows onto a
> non-ajacent page though, but this is probably because I have never been able
> to read them. Why would an article be continued on a non-ajacent page?
> 
> Whatever the case, I suspect that for the most part, an automated approach
> should be able to produce a reasonable representation of an image which
> surely is better than not having access to the document at all. It would
> also mean that thousands of documents could be processed very quickly saving
> time and cost during archiving for example.
> 
> I'm certainly not disputing the complexity of this problem, nor the fact
> that there may not be a solution. I'm just genuinely surprised given my
> experience that a reasonable solution doesn't exist and am curious to
> understand why.

Received on Sunday, 3 March 2013 12:53:13 UTC