- From: Andrew Kirkpatrick <akirkpat@adobe.com>
- Date: Tue, 5 Jan 2016 01:41:56 +0000
- To: Andrew Cunningham <andj.cunningham@gmail.com>, "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
- Message-ID: <411D9AF9-742C-4E39-8519-61E134788497@adobe.com>
Andrew, I asked a couple of our experts on PDF and fonts (Leonard and Matthew) and have the following info that I hope will help: It is true that the instructions in the PDF content stream are not (necessarily) directly connected to Unicode code points. In most cases, if the PDF processor is attempting to extract Unicode values from text, then the values in the content stream would be mapped to Unicode as per clause 9.10 in ISO 32000-1:2008. The methods described there - including using standard encodings, ToUnicode tables (if present), cmap resources for embedded fonts and ActualText entries – apply to any/all fonts types that can be present in a PDF (including both CFF and TTF-based OpenType fonts). As noted, details are in the standard. The ToUnicode map is available to all fonts. Not only that, it allows mapping not just from a glyph index to a Unicode character, but to any number of characters. As a simple example, if you take an fi ligature character, that would be a single glyph, but would map to two Unicode code points. Basically, there’s no limitation in PDF, as long as you embed the correct information, you can map any set of glyphs deterministically to Unicode. When the visual representation of the text does not match the semantic representation (be it via substitutions, reorderings, etc.) then the use of ActualText is the proper mechanism for addressing that situation (as described in cause 14.9.4 of ISO 32000-1:2008). ActualText is NEVER EVER EVER used to represent the results of OCR – that would be a violation of the standard. As noted above, the standard is clear about how to do text extraction including the use of ActualText. Any software that does not respect the presence of ActualText would be doing so in violation of the standard. So AFAIK, all major PDF viewers – such as Adobe Acrobat/Reader, FoxIt (and the Chrome variant thereof) and Apple Preview all recognize and process ActualText. I believe that the Google indexer also respects it. Beyond that – you’d need to consult with the vendor/code to determine if it comply with the standard. AWK
Received on Tuesday, 5 January 2016 01:42:27 UTC