PDF accessibility and complex script languages.

Hi all,

I have been working through http://www.w3.org/TR/WCAG20-TECHS/pdf.html and
other resources discussing PDF accessibility. I have also been looking
through the various ISO and Adobe documentation and testing PDF generation
in various ways.

My focus is on text content in languages the state and federal governments
in Australia are likely to publish information in.

I will start by making a statement, which I hope is wrong, but after
testing I fear may be correct. Feel free to correct me and point me in the
right direction.

As far as I can tell the PDF format is fundamentally based on a glyph
encoding model rather than a character encoding model. There are mechanisms
within the PDF specs, ie /ToUnicode mappings to allow the mapping of SOME
glyphs to Unicode code points. For simple scripts (non-complex scripts)
with well developed "simple" fonts, most, if not all glyphs would be mapped
within the cmap table in the font. Allowing the PDF generator to create an
appropriate ToUnicode mapping.

most of the documentation I can access on the ToUnicode mappings in PDF
files related primarily to CID coded fonts, rather than OpenType fonts.

I assume that for OpenType fonts the font's cmap feature is used as the
basis of generating the To Unicode mapping. If so, that would explain the
inability of PDF generators to correctly map glyphs to Unicode codepoints.

My understanding of the OpanType cmap feature is that only some glyphs
would be mapped to Unicode codepoints. Certain glyphs would not be present
in OpenType cmap files. Ie  a few or many glyphs may not be resolvable to
Unicode codepoints

It is also my understanding that PDFs will always have problems with glyphs
generated through the pre-base substitution features, ie even if the glyph
can resolve to a specific codepoint, is position in the data will be
incorrect, ie pre-consonant vowels like the dependant e vowel in Myanmar
script will be ordered before the consonant in a PDF file rather than after
a consonant and medial, etc.

This leaves the possibility of ActualText. The most common use of
ActualText I have seen is the generation and embedding of a text layer into
a scanned PDF via OCR.

In theory with access to the original document it would be possible to add
ActualText to each tag, so that the PDF would contain both the glyphic
based text content and a Unicode ActualText fields.

Assuming a PDF has both text and ActualText, which would be used by
indexing, searching and accessibility software? Is there any software tools
that would use ActualText in preference to the text in the PDF?

If no, it would seem that PDFs can not be accessibility except for specific
languages with well developed fonts.

To date even if I use all the techniques in
http://www.w3.org/TR/WCAG20-TECHS/pdf.html I can not seem to create
accessible documents in some languages, such as Burmese. The closest I can
get involves avoiding font subsetting, embedding the complete font, and
parsing the text, doing some reordering, normalisation and conversion on
the text before passing the modified text to an indexer or screen reader
etc. This would be the ideal case, which would rarely hold true.

Are my assumptions valid? Have I over looked something? Are there tools out
there that can give better results than I have been able to achieve with
what I have at hand?

The reason for these questions: I am currently reviewing some guidelines on
translated web content on state government sites. Currently most
translations are deployed as PDF files. But at the moment I can't see a way
to ensure all PDF files are searchable,, indexable and accessible.

I am leaning towards guidelines that require the primary format for
translated government information to be HTML, and avoid PDfs except for
printing,

Would this make sense? Or am I off the track here?

Andrew

Andrew Cunningham
andj.cunningham@gmail.com

Received on Monday, 4 January 2016 17:50:45 UTC