Re: PDF accessibility and complex script languages. from Andrew Cunningham on 2016-01-07 (w3c-wai-ig@w3.org from January to March 2016)

From: Andrew Cunningham <andj.cunningham@gmail.com>
Date: Thu, 7 Jan 2016 12:57:07 +1100
To: Duff Johnson <duff@duff-johnson.com>
Cc: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Message-ID: <CAOUP6Kmb-bmT7BDxY+UMj0wbFhbqQ+QQ+5MxNGrLsOwb_=JShw@mail.gmail.com>

Hi Duff,

The scenario I was discussing not use of ActualText for images. But its use
for the text in those languages and writing scripts that are ill supported
by the PDF character model.

It is important to realise that PDF uses a glyph based model more akin to
pseudo-Unicode font solutions than to Unicode font solutions. OpenType
features that do not modify cmap entries and reordered glyph sequences are
particularly problematic.

When I get back to office I will create sample PDF files with Burmese
syllables in Unicode , using various OpenType fonts (using a selection of
fonts using the mymr and mym2 OpenType script codes).

So far my tests have been with mymr style fonts, but will also test the
newer mym2 fonts from Microsoft and Google. Mymr is problematic, but uses
legacy approach,  using rlig, clig, and liga features.

mym2 will be interesting to test. Mym2 is the way Myanmar fonts should be
developed and implemented in Opentype, while mymr were hacks working within
the restrictions the the DFLT script in rendering engines.

I suspect that PDF files will have greater problems with mym2 based fonts,
but need to test it.

But recapping .. my concerns are related to accessibility of text in PDFs
written in languages that use complex scripts. ActualText seems to be the
only way to get meaningful Unicode into the PDF. But if as Duff indicates,
the actual use of ActualText is at the discretion of implementers, then I
think we have an accessibility issue that PDF/UA inadequately addresses.

The reality I suspect is that any PDF in certain languages, as things
current stand, can not be guaranteed to be accessible even if all other
WCAG requirements are met, since the most fundamental issue, the text
itself, is at question.

Andrew

On Thursday, 7 January 2016, Duff Johnson <duff@duff-johnson.com> wrote:
> Hi Andrew,
>
>> The results were
>>
>> 2) exporting as text file - generated text file used the visible text in
PDF, it did not use the contents of the ActualText tags
>>
>> 3) cutting and pasting - pasted text was based on the visible text in
PDF, it did not use the contents of the ActualText tags
>
> Do consider that these are very distinct functions, and that consuming
implementations are within their rights to ignore ActualText if it’s not
appropriate to the user’s needs.
>
> For example, when exporting a document to HTML it may or may not be
appropriate to replace images with ActualText. Maybe the images themselves
should be exported… (I am leaving aside the question of how to represent
ActualText in HTML… that’s for another day…)
>
> On the other hand, when a search-engine consumes PDF, ActualText should
*always* be used, otherwise there’s nothing to index… :-)
>
> Duff.
>
>
>

Received on Thursday, 7 January 2016 01:57:37 UTC