Re: Screen readers - usage stats? from David Woolley on 2004-04-20 (w3c-wai-ig@w3.org from April to June 2004)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Tue, 20 Apr 2004 22:55:26 +0100 (BST)
To: w3c-wai-ig@w3.org
Message-Id: <200404202155.i3KLtRm02104@djwhome.demon.co.uk>
>   Now I'm atleast lost - how can the Postscript print driver be the
>   cause of a problem like this ?

PDF is a combination of:
-  PostScript, with all the procedural bits run and just the display
   primitives with all parameters resolved to absolute values;
-  information on the page structure and hyperlinks; and
-  if it is tagged, an overlay of information on the logical structure.

If PDF is generated from MSWord, as is quite common, you create a 
PostScript file to disk, and run the Distiller program, to resolve
the PostScript to the absolute display primitives of the PDF, or
you use the pdfwrite device driver.  Certainly in the former case,
and I believe in the latter case, the generic printer driver code in
Windows microspaces the output and outputs alternating show character
and move operations, with nothing at all for spaces, rather than 
outputting a complete string and the spacing information in separately.

The result is that the Adobe tools have already lost the information on
word boundaries, etc., and Acrobat Reader has to be very clever to try
and reconstruct it.

What I believe happens in the tagged PDF generation support for MS
Office is that the document automation features of Word are used to
read the original document, and Acrobat then tries to match this 
information up with the microspaced output that it is seeing.

> 
>   There is nothing that says a PDF document ("tagged", I believe they
>   call it) can't use non-absolutely positioned letters and then

This predates tagged PDF by somewhat over half a decade, and maybe a whole
decade; it may be as old as PDF.  For a very long time the authoring
guidelines for PDF have said that you should not suppress spaces and
you should try to keep words together and PDF has had primitives for
indicating microspacing in parallel with the main text.  However typical
authoring tool routes don't respect this and the sort of PDF I've seen
generally violates all these rules.  (When claiming to be better than
PDF or Flash, SVG advocates tend to compare best practice in SVG with
typical practice in others!)

If you hand author PDF, or if you use a tool like groff, that does
minimal spacing adjustments in producing PostScript, you will get
very compact PDF that allows relatively easy extraction of text.
If you use the tools that are typically used, you will get bloated
PDF which requires a lot of machine intelligence to reconstruct
the source text.

Tagged PDF requires that PDF text primitives be used properly, and
slightly augments them by, for example, allowing the expanded form
of ligatures to be encoded.  However what it is mainly is a structural
overlay which isolates things like page headers that aren't really
part of the document, then allocates the remaing textual elements
to equivalent HTML-like (very like) structural elements - often several
graphic primitives to one element.  This is done partially inline,
where there is a close correlation, and partly as a parallel structural
tree.

I don't know how well the Adobe tools tag Word documents, but I do
know that they will only do it properly if people use styles, headings,
etc. properly in Word; most people write presentational Word!

>   translate that to absolutely positioned letters in Postscript prior to
>   printing.

Especially if you are using a recent version of PostScript, Acrobat 
Reader actually does a very literal translation into PostScript, as
most of its primitives are simply PostScript primitives with only
literal values allowed.  For the more complex ones, it outputs PostScript
procedures, so that the main body of the PostScript can be a fairly 
literal version of the PDF.  It can't introduce microspacing itself,
as that would violate the fundamental principle that PDF represents the
document as it would have been printed; it's the ultimate pixel
perfect format.

(Note you cannot see the similarity to PostScript in normal PDF,
because runs of graphics primitives are compressed using general purpose
compression algorithms (e.g. LZ77 as use in PKZip).  However, it is
possible to generate valid PDF that is not compressed and you will see
that it is actually a textual format, in the sense that SVG is, not a
binary one.)
> 
>   This is what happens to HTML with a printer stylesheet, after all
>   (given that the printer handles Postscript).

HTML is not a final form language.  PDF and SVG are final form.

>   PDF have flaws, but a properly "marked up" PDF file isn't all that
>   much different in principle than a properly marked up HTML document.

Except that it is final form.  Tagging allows it to be reflowed of course.
Personally, I think that tagged PDF is a better solution for commercial
web pages because it starts from a strictly reproducible presentational
form and then adds structuring.  That seems to be the work flow used by
commercial web designers, but it is not a good work flow for HTML!

The key point about being final form is that microspacing is not under the
control of the viewer.

(Because PDF is intended to be a universal format for accurate visual
reproduction of documents, the document might be a scanned image of a
non-machine processable document, in which case there is no text, although
tools do exist to OCR the document and add a text underlay.)

Used correctly, SVG has separate microspacing data and you can use tspan
for each line of the text so that paragraphs are extracted as a complete
unit.  However this doesn't help if the tools don't do this, or if
the designer pasted up the document in an order that was convenient
for them, but not a logical reading order.
Received on Tuesday, 20 April 2004 18:04:00 UTC