RE: PDF - Text Extraction File

Morning!

If the PDF could be shared, a lot of speculation could end.

If it is a campus map, it is most likely a scanned graphic that is untagged.

Without being able to look at the document, we can't determine the accessibility barrier.

Several of us who have worked in the field of PDF creation and remediation have said that this sounds like a scanned graphic of a map that is not tagged, and therefore is not accessible.

Performing OCR or optical character recognition on a map can result in two unsatisfactory scenarios:

1. the map will be tagged as a graphic requiring Alt Text. Alt Text would be difficult to provide as it would require a lot of unstructured text that could crash the screen reader buffer.
2. Any text on the map could be converted to text which would also be an accessibility barrier because someone would hear a list of names, buildings, streets or other bits of text with no context.

We need to be able to open the PDF in a PDF Editor in order to examine what is going on and provide possible solutions. If it is a campus map, please send the link to download it.

Cheers, Karen

-----Original Message-----
From: David Woolley <forums@david-woolley.me.uk> 
Sent: Saturday, October 18, 2025 8:30 PM
To: w3c-wai-ig@w3.org
Subject: Re: PDF - Text Extraction File

On 19/10/2025 00:12, David Woolley wrote:
> I'm wondering if the web server is serving it wrongly, possibly wrong 
> content-encoding, and some browsers are fixing that up.  I downloaded 
> it with Firefox ESR, from Debian 12, and ran the Debian 12 pdftotext, 
> against it.
> 
> 
There are quite a lot of response headers that are new to me, but the ones that actually describe the document are very straightforward, and say it is PDF, with nothing special done to it.

However, I wonder if you are using something other than a mainstream browser.  It is possible that you are being served with substitute document, intended for suspected crawlers, to stop for, example, AI being trained on the data, or, in the past, people building databases that could be used use to undermine the site owner's business model, or simply to delete the adverts that pay for the site.

Could you check the size of the file, which should be 3216271, and, if using Linux, or FreeBSD, run the "file" utility against it; you should get:

$ file ivany-map.pdf
ivany-map.pdf: PDF document, version 1.5, 1 pages $

(I'm wondering if you have been served a compressed version and it hasn't been uncompressed by the download tool.  Maybe there is a
Contents-Encoding: gzip header, in the response, that is being ignored, although note that this wasn't present for responses from either Firefox or wget.)

For reference, these are the HTTP headers I got (there is no more input from me after this):

HTTP/1.1 200 OK
Cache-Control: max-age=2592000
Content-Type: application/pdf
Last-Modified: Wed, 15 Oct 2025 00:48:42 GMT
Accept-Ranges: bytes
ETag: "19e2f4736d3ddc1:0"
Server: Microsoft-IIS/10.0
X-UA-Compatible: IE=edge
Permissions-Policy: camera=(), fullscreen=(self), geolocation=(*),
microphone=()
Referrer-Policy: no-referrer-when-downgrade
X-Content-Type-Options: nosniff
X-Xss-Protection: 1; mode=block
Content-Security-Policy-Report-Only: default-src 'self' *.nscc.ca; img-src 'self' *.nscc.ca *.gstatic.com *.fontawesome.com *.google.ca
  --- much more of the same ---
  ancestors 'self' *.nscc.ca:*;
Content-Security-Policy: frame-ancestors 'self' *.nscc.ca:*;
Date: Sat, 18 Oct 2025 23:24:06 GMT
Content-Length: 3216271
Strict-Transport-Security: max-age=157680000

Received on Monday, 20 October 2025 12:15:17 UTC