getting more data into PDF

On April 11, 2015 5:30:53 PM EDT, Larry Masinter <masinter@adobe.com> wrote:


>>I’d like to discuss some ideas around how to get more data, linked
>>data, into PDF (and perhaps video and images), and real publication
>>workflows, so that file format isn’t a barrier to open data. 
>>
>>I hope to have a demo or at least some examples to show.


On 4/12/15, 6:39 AM, "Sandro Hawke" <sandro@w3.org> wrote:

>
>It took me a minute to understand you meant you want to discuss this at the open meeting, not that you want to discuss this on this mailing list, ... and I'm the one who phrased the text saying to send presentation proposals here.  Heh.
>
>It's hard for me to come up with a workflow for this, so I'll be interested in what you have to say. ...


How to put Data into PDF

There are lots of places to put data or links to it, in a PDF.  I’m exploring multiple options which could be combined:


* XMP: the document metadata (triples whose subject is “this document”) and XMP provides a standard place for title, author, dublin core.
There’s support for XMP in many tools, although we should qualify which ones are good for “open data”.

* annotations: dPDF supports “annotations” where you can annotate text with markup, but there are many kinds of annotation and we could make one “data” or “link to data” where the body of the annotation is some compact representation. The workflow is like that of RDF/a, isn’t it?

* form-data: PDF supports forms and form-fields which can be filled with data. This could translate to a set of assertions based on form-data.

* role maps: PDF at the lowest level has ‘role maps’ where elements as strings can be given ‘role’ labels. It’s likely that this level of annotation needs to be captured at the time of initial document generation.

* document structure: PDF supports document structure (section bookmarks basically)

* file attachment: PDF/A-3 adds support for file attachments. Attach associated data (in a compact form). The resulting document/data package can be published, signed, authenticated, backed-up as a package.  Re-injecting the data into a document could be done by a little open source utility.

* Manifest in XMP: Perhaps we could generalize the notion of data-bearing-presentation-form to video and images, leveraging the fact there are common metadata standards (including XMP). The metadata could contain a “data manifest”, metadata of the form “This document describes results based on data <url-of-data> interpreted as <description of data format>”. This might then form a unifying API for data extraction for many kinds of media.


* In XMP blob: not so pretty but you could add data in the XMP itself, there are some prior XMP applications which did this



Current attempts to deal with PDF have all focused on ‘scraping’ data out of a representation where data retention hasn’t been a requirement. Just as with RDF/a and HTML, the publisher has to do some more work to retain and/or reinject data in the publication process to actually meet the requirements.

Getting data out of PDF:

Now that all the major browsers have their own PDF readers, getting updates to the viewing experience for annotations, attachments requires broader attention, but that seems like one of the consequences of open standards.

But it seems like there are sufficient libraries for XMP and PDF that the open source tools for RDF/a might be enhanced to deal with PDFs with data, too. Which ones, what functions, would be an interesting discussion.

Received on Tuesday, 14 April 2015 00:02:40 UTC