Re: PDF - Text Extraction File from David Woolley on 2025-10-18 (w3c-wai-ig@w3.org from October to December 2025)

From: David Woolley <forums@david-woolley.me.uk>
Date: Sun, 19 Oct 2025 00:12:24 +0100
To: w3c-wai-ig@w3.org
Message-ID: <422abe87-721a-4dfb-87cc-e84ab75da3aa@david-woolley.me.uk>

On 17/10/2025 23:24, Karen Lewellen wrote:
> I resonate with this interpretation.
> There is a tool in Linux called pdftotext.  When I run the file through 
> that tool not only does it get syntax errors related to  things like a 
> dictionary, the resulting .txt file is actually blank.

pdftotext is the tool I incorrectly referred to as pdf2text. It runs 
with no errors, and returns a significant amount of text.  As I've 
already pointed out, the text is really of no use to someone who wants 
to know their way around, so the whole document is unusable to a 
non-visual user, but it does pass a test of whether there is text in it, 
if that is what you want to use as your test of accessibility. 
Personally, I don't think you can automate a pass test on this; you have 
to read the text.

I'm wondering if the web server is serving it wrongly, possibly wrong 
content-encoding, and some browsers are fixing that up.  I downloaded it 
with Firefox ESR, from Debian 12, and ran the Debian 12 pdftotext, 
against it.

$ pdftotext ivany-map.pdf
$ wc ivany-map.txt
  222  302 1639 ivany-map.txt
$ head -10 ivany-map.txt
Welcome to Ivany Campus
Secondary
entrance

1
Employee
parking

Electric
vehicle
$ shasum ivany-map.pdf
f8d7568fd31159cb7922ffae30b2f9f3a44b1fb8  ivany-map.pdf
$

Received on Saturday, 18 October 2025 23:12:42 UTC