- From: Larry Masinter <masinter@adobe.com>
- Date: Sat, 27 Aug 2016 06:27:54 +0000
- To: "public-pdf-open-data@w3.org" <public-pdf-open-data@w3.org>
Greetings – I’m assuming there are some folks who have subscribed without joining the group. Welcome; introduce yourselves. If you’re just browsing the archives, consider joining. There seem to be a number of projects focused on scraping data out of PDF files, but the process is necessarily heuristic and incomplete. The relationship is we’re looking for a way of stuffing the results of this kind of analysis back into the PDF file in a way that IS machine readable. For example, https://docparser.com/blog/getting-started-docparser/ I’ve gotten mixed opinions about whether a new PDF profile (akin to PDF/X, PDF/UA, PDF/E) call it PDF/D “PDF with data” A file can be both PDF/D and PDF/UA (and even all three with PDF/A-3). So you could use DocParser (or some other process) and generate PDF/D versions. PDF/D would have several optional features “Text Available” (Yes / No), “Tables” (with named units/interpretations) and possibly images.
Received on Saturday, 27 August 2016 06:28:28 UTC