some recent links from Larry Masinter on 2016-08-27 (public-pdf-open-data@w3.org from August 2016)

From: Larry Masinter <masinter@adobe.com>
Date: Sat, 27 Aug 2016 06:27:54 +0000
To: "public-pdf-open-data@w3.org" <public-pdf-open-data@w3.org>
Message-ID: <D5ED9E3C-B40D-4B57-9498-9093603B26EA@adobe.com>

Greetings – I’m assuming there are some folks who have subscribed without joining the group. Welcome; introduce yourselves.  If you’re just browsing the archives, consider joining. 

There seem to be a number of projects focused on scraping data out of PDF files, but the process is necessarily heuristic and incomplete.  The relationship is we’re looking for a way of stuffing the results of this kind of analysis back into the PDF file in a way that IS machine readable.


For example, https://docparser.com/blog/getting-started-docparser/


I’ve gotten mixed opinions about whether a new PDF profile (akin to PDF/X, PDF/UA, PDF/E) call it  PDF/D “PDF with data”  
A file can be both PDF/D and PDF/UA (and even all three with PDF/A-3).  

So you could use DocParser (or some other process) and generate PDF/D versions.

PDF/D would have several optional features   “Text Available” (Yes / No), “Tables” (with named units/interpretations) and possibly images.

Received on Saturday, 27 August 2016 06:28:28 UTC