Re: Acrobat PDF & Accessibility from David Woolley on 2001-12-21 (w3c-wai-ig@w3.org from October to December 2001)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Fri, 21 Dec 2001 20:21:04 +0000 (GMT)
To: w3c-wai-ig@w3.org
Message-Id: <200112212021.fBLKL4O23876@djwhome.demon.co.uk>
> Can someone explain to me what do you mean by *accessible* PDF?

What is currently being talked about as accessible PDF is probably
a combination of MS active accessibility in Acrobat Reader and the
addition of an alternative document tree that maps the document into
a logical structure.  The nearest equivalent, to the latter, in earlier
PDF was probably the thread and bead mechanism that allowed one to 
reassemble the sort of magazine article that is spread over odd columns
on several pages.

Having that tree is a bit like being able to write your HTML as a
table, without regard to linearisation issues, but still be able
to linearise it, by having instructions as to the logical structure.

> Is it PDF without "microspacing" and "words being broken up"?

My impression is that PDF has always been intended to avoid this, but
a lot of PDF is simply created from PostScript.  Acrobat actually makes
quite a good job of inferring word boundaries.  (Someone wanting to get
pixel perfect magazine pages in SVG was recently advised, on the SVG list
to place every character!)

> If you send me off-list some small (<30K) .doc or .html or simple RTF file, 
> and explain *how* should good PDF file produced from that doc look like, I 
> wil do the testing and post results here.

It's probably easier to send the PostScript recovered from the PDF,
to avoid issues with compression.  I'd suggest an example might be
comparing the PostScript created by IE, from the HTML specification, with
the PostScript recovered from the PDF version.  The latter was created
with the freeware html2ps tools (post processed with Acrobat Distiller,
although ghostscript might have worked).  I believe that html2ps has a
very simplistic (more simplistic than PDF can handle well) text layout
algorithm, so should generate long runs of characters; yes it is clean.
This is a small fragment (the uncompressed PDF would look more or
less the same):

      (THIS DOCUMENT IS PROVIDED "AS IS," AND COPYRIGHT HOLDERS MAKE) Tj
      -11 -13.2 Td
      (NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED,) Tj
      0 -13.2 Td
      (INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY,) Tj
      0 -13.2 Td
      (FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE;) Tj
      0 -13.2 Td
      (THAT THE CONTENTS OF THE DOCUMENT ARE SUITABLE FOR ANY) Tj

I'll need to switch to Windows to get the equivalent.

> How many people understand PostScript and PDF? Not too many, IMO.
> I was studing PostScript around 7-8 years ago, but since that time PLRM 
> version 3.0 was published, it's 950 pages, and I just don't have time to come 
> through...

In the early days, people hand coded them.  In the early days of HTML, people
hand coded it.

> Most people use auto-generated PostScript (Windows or MacOS "PS driver", some 
> publishing software, Adobe tools after all) 

The same is becoming true of HTML, and is almost certain to become true of
SVG.

However, in this context, they were making the same mistake as was made 
earlier in this thread, of assuming that PDF was just an image format.
In fact, they saved each page of the brochure to a separate file, and
the pages were simply bitmaps.  This isn't a case of understanding how
to hand code it.  It's a case of PostScript being a buzzword associated
with printers, and nothing more.

Many people commissioning HTML have no concept of what it really is.

The reason the specification is so long is that it is relatively complete.
As SVG is more powerful, if the specification is not even longer, it is
unlikely to be complete enough to allow consistent implementations, 
especially if there is no single reference version.
Received on Friday, 21 December 2001 15:23:23 UTC