Re: PDF accessibility and complex script languages. from Andrew Cunningham on 2016-01-06 (w3c-wai-ig@w3.org from January to March 2016)

From: Andrew Cunningham <andj.cunningham@gmail.com>
Date: Thu, 7 Jan 2016 00:48:09 +1100
To: Andrew Kirkpatrick <akirkpat@adobe.com>
Cc: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Message-ID: <CAOUP6Km-vJ9AxGNRyRyaeNjRov-xxkvmJW7vCUZMZQCWy2nbeQ@mail.gmail.com>
Hi Andrew,

thanks for your response.

Andrew Cunningham
andj.cunningham@gmail.com

On 5 January 2016 at 12:41, Andrew Kirkpatrick <akirkpat@adobe.com> wrote:

> Andrew,
> I asked a couple of our experts on PDF and fonts (Leonard and Matthew) and
> have the following info that I hope will help:
>

Thanks, greatly appreciated.

It clarifies a few things ... for complex script languages, if there is no
ActualText then the PDF is unlikely to be accessible.


>
> It is true that the instructions in the PDF content stream are not
> (necessarily) directly connected to Unicode code points.  In most cases, if
> the PDF processor is attempting to extract Unicode values from text, then
> the values in the content stream would be mapped to Unicode as per clause
> 9.10 in ISO 32000-1:2008.  The methods described there - including using
> standard encodings, ToUnicode tables (if present), cmap resources for
> embedded fonts and ActualText entries – apply to any/all fonts types that
> can be present in a PDF (including both CFF and TTF-based OpenType fonts).
>   As noted, details are in the standard.
>

I will look at that section in more detail.


>
> The ToUnicode map is available to all fonts.  Not only that, it allows
> mapping not just from a glyph index to a Unicode character, but to any
> number of characters.  As a simple example, if you take an fi ligature
> character, that would be a single glyph, but would map to two Unicode code
> points.  Basically, there’s no limitation in PDF, as long as you embed the
> correct information, you can map any set of glyphs deterministically to
> Unicode.
>
>
the one glyph to multiple characters works well with ligatures and
variation selectors, and seems to be one of the primary uses of that aspect.


> When the visual representation of the text does not match the semantic
> representation (be it via substitutions, reorderings, etc.) then the use of
> ActualText is the proper mechanism for addressing that situation (as
> described in cause 14.9.4 of ISO 32000-1:2008).   ActualText is NEVER
> EVER EVER used to represent the results of OCR – that would be a
> violation of the standard.
>
>
The cmap entries in a font will depend on OT features being used, variation
selectors, ligatures (rlig and liga, possibly clig) would have entries.

Although with complex scripts, it is stabdard for many glyphs in fonts not
to be assigned Unicode values, and should not have Unicode values.
reordering of glyphs within clusters is also problematic.

Are there any tools that allow you to export a ToUnicode mapping form the
PDF file, edit it manually and then reintegrate it into the PDF file? I
suspect that manual massaging of ToUnicode will be necessary.



>
> As noted above, the standard is clear about how to do text extraction
> including the use of ActualText.   Any software that does not respect the
> presence of ActualText would be doing so in violation of the standard.  So
> AFAIK, all major PDF viewers – such as Adobe Acrobat/Reader, FoxIt (and the
> Chrome variant thereof) and Apple Preview all recognize and process
> ActualText.  I believe that the Google indexer also respects it.  Beyond
> that – you’d need to consult with the vendor/code to determine if it comply
> with the standard.
>
>
thankyou, I will test the complex script test files that we have with
ActualText

What tools exist for adding or editing ActualText? For instance if I was
working on an 80 page Sgaw Karen document, where ActualText would need to
be added to every tag, what tools exist to handle the editing? The tools I
have tried so far are cumbersome and extremely time consuming. Have any of
the developers/vendors created an optimised workflow for working with
ActualText?

I can use Acrobat Pro to add ActualText but its workflow if I was adding
actual text to hundreds or thousands of tags would be painfully slow

Just ran a sort test on a simple burmese document: one heading and two
paragraphs of text, I added ActualText entries to teh text for teh H1 and
each P container.

I then (using Acrobat Pro DC 2015):
1) exported document to MS Word format
2) exported document to text file
3) copy and pasted text form PDF

The results were amusing to say the least.

1) Word document (and I assume any formatted export ... appears to use the
visible text in the PDF ... so exported content is the same as if there was
no ActualText layer

2) text export initial was even worse ... just got a document full of dots.
I dug into it deeper and notice an encoding option that was set to "Use
mapping table default". I changed that to UTF-8 and exported it again ...
this time I got a working text file which contained the ActualText content.

3) cutting and paste results had me initially confused. When I cut and
paste i got correct unicode content (i assume from ActualText) but instead
of a single occurrence I got multiple copies of the ActualText.I think I
understand why that is happening. But not exactly sure. Would need to test
it further.

I will run tests on other tools as well and see what results I get.

Andrew
Received on Wednesday, 6 January 2016 13:48:37 UTC