- From: David Woolley <forums@david-woolley.me.uk>
- Date: Sat, 02 Mar 2013 20:50:18 +0000
- To: w3c-wai-ig@w3.org
Ian Sharpe wrote: > I'm no expert in PDF accessibility, tagging etc. But having worked on facial > image recognition software over 15 years ago now and loosely followed > progress in this area, I am really surprised that current OCR technology > couldn't make at least a decent stab at automating the tagging process of > scanned documents. I'm not sure that we are really talking about scanned documents, although there are scanned documents in PDF that don't have an OCR underlay, especially when people are trying to avoid the cost of the Adobe tools. The problems I see are in recovering things like heading levels, block quotes, correctly identifying list levels, etc. A particular problem with some documents will be that they have been composed presentationally, and the styling may not be consistent enough to allow an automated tool to correctly reverse engineer it without deep understanding of the content. Another risk area is false positives, for things like identifying page headings. I used the cite/emphasis distinction as an example and I'm going by the translation abilities of things like Bablelfish and Google Translate to indicate that tools don't have the semantic understanding to distinguish between those. (In fact, my understanding is that Google Translate really has no deep understanding and works on statistical patterns. Even with things like reflowability, I am sure that automated tools will make wrong decisions. The extreme case would be detecting and avoiding reflowing poetry if lines happened to be near full. Particularly difficult things would be magazine articles, where the tail of the article is on a non-adjacent page. -- David Woolley Emails are not formal business letters, whatever businesses may want. RFC1855 says there should be an address here, but, in a world of spam, that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 2 March 2013 20:50:45 UTC