W3C home > Mailing lists > Public > w3c-wai-ig@w3.org > January to March 2016

Re: PDF accessibility and complex script languages.

From: Duff Johnson <duff@duff-johnson.com>
Date: Tue, 5 Jan 2016 16:34:36 -0500
Message-Id: <9CBE4DA1-CDB1-46EA-BB9A-D7777D90EC06@duff-johnson.com>
To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>

> On Jan 5, 2016, at 10:22, Andrew Kirkpatrick <akirkpat@adobe.com> wrote:
> 
> If actual text can’t be used then I’d love to know where one should correct OCR’d text that was incorrectly identified by Adobe Acrobat during the OCR process but is not flagged as an OCR suspect/error.  Where might a user fix the OCR text?  Is there is some contents key in the tag editor where this can be corrected?
>  
> Assuming that the OCR’d text has been embedded in the file as invisible text, then that text is what should be corrected.  Trying to override it using anything else is incorrectshouldn’t be done at a tagging level.
> AWK

Some OCR applications can output a mix of text and bitmaps where bitmaps are used to represent OCR suspects and other objects the OCR believes (incorrectly) aren’t textual.

ActualText has utility in this use-case, but (precisely as Andrew says) not with respect to the *recognized* (even if mangled) text, which should be preferably be corrected instead.

Rather, ActualText may be useful because it’s how one may provide the “actual text” that happens to be represented as a bitmap due to the limitations of the OCR process. IE, for text which fails effective recognition entirely, providing no useful editable text to be corrected.

You can use Acrobat DC (or other similar editor) to add ActualText to structure elements. Use a <Span> structure element to enclose the bitmap object, then apply the ActualText property to that structure element.

Duff.
Received on Tuesday, 5 January 2016 21:35:08 UTC

This archive was generated by hypermail 2.3.1 : Friday, 29 January 2016 16:39:04 UTC