Re: google and pdf from David Woolley on 2001-02-10 (w3c-wai-ig@w3.org from January to March 2001)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Sat, 10 Feb 2001 14:07:09 +0000 (GMT)
To: w3c-wai-ig@w3.org
Message-Id: <200102101407.f1AE79o20047@djwhome.demon.co.uk>

> Google provides a plain text version of the document.

That this should be possible has always has always been an aim of
PDF, although the document for which I found this feature didn't
have a successful text extraction.

It depends on the document being in text (not a scan with no backing
text), composed in a sensible reading order, and the text extractor
being able to cope with the excesses of micro-spacing in the authoring
tool.  (Word/Windows tends to place each character separately, so the
extractor has to guess the word boundaries from the spacing, whereas
it is trivial to extract text from PDF written to the PDF authoring
guidelines.)

(To the extent that SVG is created with similar tools to those used to
create PDF, text extraction will be similarly easy or difficult.)

Received on Saturday, 10 February 2001 09:07:16 UTC