- From: David Woolley <david@djwhome.demon.co.uk>
- Date: Tue, 20 Apr 2004 22:55:26 +0100 (BST)
- To: w3c-wai-ig@w3.org
> Now I'm atleast lost - how can the Postscript print driver be the > cause of a problem like this ? PDF is a combination of: - PostScript, with all the procedural bits run and just the display primitives with all parameters resolved to absolute values; - information on the page structure and hyperlinks; and - if it is tagged, an overlay of information on the logical structure. If PDF is generated from MSWord, as is quite common, you create a PostScript file to disk, and run the Distiller program, to resolve the PostScript to the absolute display primitives of the PDF, or you use the pdfwrite device driver. Certainly in the former case, and I believe in the latter case, the generic printer driver code in Windows microspaces the output and outputs alternating show character and move operations, with nothing at all for spaces, rather than outputting a complete string and the spacing information in separately. The result is that the Adobe tools have already lost the information on word boundaries, etc., and Acrobat Reader has to be very clever to try and reconstruct it. What I believe happens in the tagged PDF generation support for MS Office is that the document automation features of Word are used to read the original document, and Acrobat then tries to match this information up with the microspaced output that it is seeing. > > There is nothing that says a PDF document ("tagged", I believe they > call it) can't use non-absolutely positioned letters and then This predates tagged PDF by somewhat over half a decade, and maybe a whole decade; it may be as old as PDF. For a very long time the authoring guidelines for PDF have said that you should not suppress spaces and you should try to keep words together and PDF has had primitives for indicating microspacing in parallel with the main text. However typical authoring tool routes don't respect this and the sort of PDF I've seen generally violates all these rules. (When claiming to be better than PDF or Flash, SVG advocates tend to compare best practice in SVG with typical practice in others!) If you hand author PDF, or if you use a tool like groff, that does minimal spacing adjustments in producing PostScript, you will get very compact PDF that allows relatively easy extraction of text. If you use the tools that are typically used, you will get bloated PDF which requires a lot of machine intelligence to reconstruct the source text. Tagged PDF requires that PDF text primitives be used properly, and slightly augments them by, for example, allowing the expanded form of ligatures to be encoded. However what it is mainly is a structural overlay which isolates things like page headers that aren't really part of the document, then allocates the remaing textual elements to equivalent HTML-like (very like) structural elements - often several graphic primitives to one element. This is done partially inline, where there is a close correlation, and partly as a parallel structural tree. I don't know how well the Adobe tools tag Word documents, but I do know that they will only do it properly if people use styles, headings, etc. properly in Word; most people write presentational Word! > translate that to absolutely positioned letters in Postscript prior to > printing. Especially if you are using a recent version of PostScript, Acrobat Reader actually does a very literal translation into PostScript, as most of its primitives are simply PostScript primitives with only literal values allowed. For the more complex ones, it outputs PostScript procedures, so that the main body of the PostScript can be a fairly literal version of the PDF. It can't introduce microspacing itself, as that would violate the fundamental principle that PDF represents the document as it would have been printed; it's the ultimate pixel perfect format. (Note you cannot see the similarity to PostScript in normal PDF, because runs of graphics primitives are compressed using general purpose compression algorithms (e.g. LZ77 as use in PKZip). However, it is possible to generate valid PDF that is not compressed and you will see that it is actually a textual format, in the sense that SVG is, not a binary one.) > > This is what happens to HTML with a printer stylesheet, after all > (given that the printer handles Postscript). HTML is not a final form language. PDF and SVG are final form. > PDF have flaws, but a properly "marked up" PDF file isn't all that > much different in principle than a properly marked up HTML document. Except that it is final form. Tagging allows it to be reflowed of course. Personally, I think that tagged PDF is a better solution for commercial web pages because it starts from a strictly reproducible presentational form and then adds structuring. That seems to be the work flow used by commercial web designers, but it is not a good work flow for HTML! The key point about being final form is that microspacing is not under the control of the viewer. (Because PDF is intended to be a universal format for accurate visual reproduction of documents, the document might be a scanned image of a non-machine processable document, in which case there is no text, although tools do exist to OCR the document and add a text underlay.) Used correctly, SVG has separate microspacing data and you can use tspan for each line of the text so that paragraphs are extracted as a complete unit. However this doesn't help if the tools don't do this, or if the designer pasted up the document in an order that was convenient for them, but not a logical reading order.
Received on Tuesday, 20 April 2004 18:04:00 UTC