RE: linked open data and PDF from Larry Masinter on 2015-01-21 (public-lod@w3.org from January 2015)

From: Larry Masinter <masinter@adobe.com>
Date: Wed, 21 Jan 2015 16:01:15 +0000
To: Paul Houle <ontology2@gmail.com>, Martynas Jusevičius <martynas@graphity.org>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <DM2PR0201MB0960B1E00FF19E9CCC2F1787C3480@DM2PR0201MB0960.namprd02.prod.outlook.>

I want to back up a bit from the question people normally ask:

“I have a PDF. How do I get the data out?”
to an earlier stage
“I have a PDF I got somehow, and I also have some data I think the PDF shows. What can I do to make sure the data is easy to get out of (a new version of) the PDF?”

Where “easy to get out” is “there are widely available, multiple platform, open source tools for pulling triples out”.

This is an easier problem, in that I’m allowed to modify the PDF. It’s also a harder problem, because I don’t want to presume anything about it – it could just be a series of photographs of handwritten receipts (like I submit with expense reports).

I think a fallback that lets you inject machine-friendly data (back) into a PDF (if it’s not already there) would be most helpful. There are many choices about where data could be already. Document metadata (title, author, publication date – triples where the subject is the document itself) could be in the PDF’s XMP already.

Part of why I like this is that it works for other formats, images, videos. (More on video later, but http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/DynamicMediaXMPPartnerGuide.pdf talks about metadata for video).

There’s no point in duplicating document metadata already in XMP, because to get triples you just need an XMP-to-triple converter, and it’s likely that current RDF/XML tools can be adapted to that task.

Other data needs to be somewhere else. There are a lot of ways you might go about it: file attachments in PDF/A-3 in whatever format, a URL of some other resource, or (if the data is small), another XMP attribute.

If there are options, the option chosen still has to be identified. I think that could be accomplished by adding a document metadata property (“hasData”, something like R hasData D in format F means you apply the triple extractor for F to the resource D to get (some of) the triples for R.

That would allow anyone with a novel data-bearing file format to provide LOD for that format. Might even get you linked open data for Word files.

Larry
--
http://larry.masinter.net

Received on Wednesday, 21 January 2015 16:01:49 UTC