- From: Andrew Cunningham <andj.cunningham@gmail.com>
- Date: Tue, 5 Jan 2016 04:50:15 +1100
- To: w3c-wai-ig@w3.org
- Message-ID: <CAOUP6K=sEBwjXKUqzfYkTBDVOiB0dvbXK3PrhgDjREKU0fowgQ@mail.gmail.com>
Hi all, I have been working through http://www.w3.org/TR/WCAG20-TECHS/pdf.html and other resources discussing PDF accessibility. I have also been looking through the various ISO and Adobe documentation and testing PDF generation in various ways. My focus is on text content in languages the state and federal governments in Australia are likely to publish information in. I will start by making a statement, which I hope is wrong, but after testing I fear may be correct. Feel free to correct me and point me in the right direction. As far as I can tell the PDF format is fundamentally based on a glyph encoding model rather than a character encoding model. There are mechanisms within the PDF specs, ie /ToUnicode mappings to allow the mapping of SOME glyphs to Unicode code points. For simple scripts (non-complex scripts) with well developed "simple" fonts, most, if not all glyphs would be mapped within the cmap table in the font. Allowing the PDF generator to create an appropriate ToUnicode mapping. most of the documentation I can access on the ToUnicode mappings in PDF files related primarily to CID coded fonts, rather than OpenType fonts. I assume that for OpenType fonts the font's cmap feature is used as the basis of generating the To Unicode mapping. If so, that would explain the inability of PDF generators to correctly map glyphs to Unicode codepoints. My understanding of the OpanType cmap feature is that only some glyphs would be mapped to Unicode codepoints. Certain glyphs would not be present in OpenType cmap files. Ie a few or many glyphs may not be resolvable to Unicode codepoints It is also my understanding that PDFs will always have problems with glyphs generated through the pre-base substitution features, ie even if the glyph can resolve to a specific codepoint, is position in the data will be incorrect, ie pre-consonant vowels like the dependant e vowel in Myanmar script will be ordered before the consonant in a PDF file rather than after a consonant and medial, etc. This leaves the possibility of ActualText. The most common use of ActualText I have seen is the generation and embedding of a text layer into a scanned PDF via OCR. In theory with access to the original document it would be possible to add ActualText to each tag, so that the PDF would contain both the glyphic based text content and a Unicode ActualText fields. Assuming a PDF has both text and ActualText, which would be used by indexing, searching and accessibility software? Is there any software tools that would use ActualText in preference to the text in the PDF? If no, it would seem that PDFs can not be accessibility except for specific languages with well developed fonts. To date even if I use all the techniques in http://www.w3.org/TR/WCAG20-TECHS/pdf.html I can not seem to create accessible documents in some languages, such as Burmese. The closest I can get involves avoiding font subsetting, embedding the complete font, and parsing the text, doing some reordering, normalisation and conversion on the text before passing the modified text to an indexer or screen reader etc. This would be the ideal case, which would rarely hold true. Are my assumptions valid? Have I over looked something? Are there tools out there that can give better results than I have been able to achieve with what I have at hand? The reason for these questions: I am currently reviewing some guidelines on translated web content on state government sites. Currently most translations are deployed as PDF files. But at the moment I can't see a way to ensure all PDF files are searchable,, indexable and accessible. I am leaning towards guidelines that require the primary format for translated government information to be HTML, and avoid PDfs except for printing, Would this make sense? Or am I off the track here? Andrew Andrew Cunningham andj.cunningham@gmail.com
Received on Monday, 4 January 2016 17:50:45 UTC