Re: PDF - Text Extraction File

On 19/10/2025 00:12, David Woolley wrote:
> I'm wondering if the web server is serving it wrongly, possibly wrong 
> content-encoding, and some browsers are fixing that up.  I downloaded it 
> with Firefox ESR, from Debian 12, and ran the Debian 12 pdftotext, 
> against it.
> 
> 
There are quite a lot of response headers that are new to me, but the 
ones that actually describe the document are very straightforward, and 
say it is PDF, with nothing special done to it.

However, I wonder if you are using something other than a mainstream 
browser.  It is possible that you are being served with substitute 
document, intended for suspected crawlers, to stop for, example, AI 
being trained on the data, or, in the past, people building databases 
that could be used use to undermine the site owner's business model, or 
simply to delete the adverts that pay for the site.

Could you check the size of the file, which should be 3216271, and, if 
using Linux, or FreeBSD, run the "file" utility against it; you should get:

$ file ivany-map.pdf
ivany-map.pdf: PDF document, version 1.5, 1 pages
$

(I'm wondering if you have been served a compressed version and it 
hasn't been uncompressed by the download tool.  Maybe there is a 
Contents-Encoding: gzip header, in the response, that is being ignored, 
although note that this wasn't present for responses from either Firefox 
or wget.)

For reference, these are the HTTP headers I got (there is no more input 
from me after this):

HTTP/1.1 200 OK
Cache-Control: max-age=2592000
Content-Type: application/pdf
Last-Modified: Wed, 15 Oct 2025 00:48:42 GMT
Accept-Ranges: bytes
ETag: "19e2f4736d3ddc1:0"
Server: Microsoft-IIS/10.0
X-UA-Compatible: IE=edge
Permissions-Policy: camera=(), fullscreen=(self), geolocation=(*), 
microphone=()
Referrer-Policy: no-referrer-when-downgrade
X-Content-Type-Options: nosniff
X-Xss-Protection: 1; mode=block
Content-Security-Policy-Report-Only: default-src 'self' *.nscc.ca; 
img-src 'self' *.nscc.ca *.gstatic.com *.fontawesome.com *.google.ca
  --- much more of the same ---
  ancestors 'self' *.nscc.ca:*;
Content-Security-Policy: frame-ancestors 'self' *.nscc.ca:*;
Date: Sat, 18 Oct 2025 23:24:06 GMT
Content-Length: 3216271
Strict-Transport-Security: max-age=157680000

Received on Sunday, 19 October 2025 00:29:38 UTC