- From: Ian Sharpe <isforums@manx.net>
- Date: Sat, 2 Mar 2013 23:11:39 -0000
- To: "'David Woolley'" <forums@david-woolley.me.uk>, <w3c-wai-ig@w3.org>
Apologies if I've misunderstood the use case here but I was talking about the use of OCR technology to produce an accessible version of any PDF document. This could be in the form of a tagged PDF version of the same document or any other format for that matter. My conjecture is that if software exists to recognise faces in complex images of varying degrees of quality, surely it is possible to determine the structure of a document based on visual presentation such as indentation, font size, font weight, font style etc. After all, the visual presentation is designed to convey meaning to the reader. I'm not saying there aren't situations when an automated approach would get it right, although I'm struggling to think of an example myself. I'm not exactly sure what you mean by a magazine article which flows onto a non-ajacent page though, but this is probably because I have never been able to read them. Why would an article be continued on a non-ajacent page? Whatever the case, I suspect that for the most part, an automated approach should be able to produce a reasonable representation of an image which surely is better than not having access to the document at all. It would also mean that thousands of documents could be processed very quickly saving time and cost during archiving for example. I'm certainly not disputing the complexity of this problem, nor the fact that there may not be a solution. I'm just genuinely surprised given my experience that a reasonable solution doesn't exist and am curious to understand why. Cheers Ian Obviously things like alternative text for images would still need to be added manually for the time being but I doubt this is a serious problem in the majority of cases. -----Original Message----- From: David Woolley [mailto:forums@david-woolley.me.uk] Sent: 02 March 2013 20:50 To: w3c-wai-ig@w3.org Subject: Re: Accessible PDF Repair Ian Sharpe wrote: > I'm no expert in PDF accessibility, tagging etc. But having worked on > facial image recognition software over 15 years ago now and loosely > followed progress in this area, I am really surprised that current OCR > technology couldn't make at least a decent stab at automating the > tagging process of scanned documents. I'm not sure that we are really talking about scanned documents, although there are scanned documents in PDF that don't have an OCR underlay, especially when people are trying to avoid the cost of the Adobe tools. The problems I see are in recovering things like heading levels, block quotes, correctly identifying list levels, etc. A particular problem with some documents will be that they have been composed presentationally, and the styling may not be consistent enough to allow an automated tool to correctly reverse engineer it without deep understanding of the content. Another risk area is false positives, for things like identifying page headings. I used the cite/emphasis distinction as an example and I'm going by the translation abilities of things like Bablelfish and Google Translate to indicate that tools don't have the semantic understanding to distinguish between those. (In fact, my understanding is that Google Translate really has no deep understanding and works on statistical patterns. Even with things like reflowability, I am sure that automated tools will make wrong decisions. The extreme case would be detecting and avoiding reflowing poetry if lines happened to be near full. Particularly difficult things would be magazine articles, where the tail of the article is on a non-adjacent page. -- David Woolley Emails are not formal business letters, whatever businesses may want. RFC1855 says there should be an address here, but, in a world of spam, that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 2 March 2013 23:12:16 UTC