RE: linked open data and PDF from Larry Masinter on 2015-01-21 (public-lod@w3.org from January 2015)

From: Larry Masinter <masinter@adobe.com>
Date: Wed, 21 Jan 2015 18:06:25 +0000
To: Paul Houle <ontology2@gmail.com>, Martynas Jusevičius <martynas@graphity.org>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <DM2PR0201MB096002EC17C4BDE086999BE9C3480@DM2PR0201MB0960.namprd02.prod.outlook.>
Paul Houle <ontology2@gmail.com>:
> You should be able to pipe the InputStream that comes out of a PDF filew/ PDFBox into Jena or some other RDF toolset.

Hopefully. If not, you might be better off using one of the XMP toolkits; I'm not sure.

> A much more challenging issue is developing a serializer which will take some triples from Jena (or other toolset) and make sure 100% that it will validate as XMP.

I think the simplest serializer  might be "base 64 encode gzip of" or "escape quotes in"; that is, don't try too hard to fit non-metadata triples into XMP (see below)

> I'm thinking of a user story that I have a large number of PDF documents on my personal computer and several mobile devices.  I read standards documents the way I used to read science fiction as a kid.  I read great classics of science fiction in PDF,  get bank statements in PDF,  send marketing material to customers as PDF,  I want to be on top of my PDF.
> I think the same is true for many foaf:Agents.

Even those who are not such true believers may benefit if you substitute "non-HTML" for "PDF". Maybe you have a few JPEGE photos, also.

> One issue I see is what it is all rdf:About,  or what "" means.  In the context of a single document this is not a problem but a simple "linking" scenario would be to scan a collection of PDF documents and get all of the RDF into a triple store.

One way to think about rdf:About: "" is that XMP normally only holds triples where the subject is the document itself: Metadata but not Data.    Perhaps, when converting XMP to triples, the value of the xmpMM:InstanceID attribute could be used as the Subject. (Perhaps add a triple that the resource identified by the InstanceID GUID is a version of the resource identified by the DocumentID GUID).  

You might find in
http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/DynamicMediaXMPPartnerGuide.pdf

a method for embedding metadata about components and sources, using Ingredients and Pantry, linked by InstanceID of the sources.

> If I look at something that's a bad smell to me it is the use of a non-standard date...

I think you're talking about XMP here, I might speculate that some of the design decisions about dates and times might have been influenced by the metadata needs for images, and the date time capabilities of EXIF and IPTC.
But regardless of how it got there and whether it is non-standard or different-standard:
Yes, an XMP interpreter or serializer dealing with dates may need to do some conversion, although I think many of the XMP libraries do translate date-times to internal system dates. 

> Another thing that bugs me are text lists in the specification for values that are enumerated in the text files.... It would be nice to see some way to keep this up to date.

I'm not sure which part of XMP this complains about, but the way to keep an ISO standard up to date is through the ISO standards process. 

http://en.wikipedia.org/wiki/Portable_Document_Format#ISO_TC_171_SC_2_WG_8 for PDF.
ISO 16684 is managed by ISO/TC 130 WG 2 TF 4. 

> And about "ISO Standard",  Adobe's messaging ought to be very clear that you can download these standards for free because that's not usually true about "ISO Standards".

I'm not "Adobe messaging", but I usually just point to
http://en.wikipedia.org/wiki/Extensible_Metadata_Platform

http://en.wikipedia.org/wiki/Portable_Document_Format



On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius <martynas@graphity.org> wrote:
> I think APIs for common languages like Java and C# to extract XMP RDF
> from PDF Files/Streams would be much more helpful ...
> I've looked at Adobe PDF Library SDK...

http://en.wikipedia.org/wiki/Extensible_Metadata_Platform#Free_software_and_open-source_tools_.28read.2Fwrite_support.29

might be a better place to look.

> On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle <ontology2@gmail.com> wrote:
>> I just used Acrobat Pro to look at the XMP metadata for a standards document
>> (extra credit if you know which one) and saw something like this
>>
>> https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG

>> in this particular case this is fine RDF,  just very little of it because
>> nobody made an effort to fill it in.  The lack of a title is particularly
>> annoying when I am reading this document at the gym because it gets lost in
>> a maze of twisty filenames that all look the same,

Your example has pdf:Producer: Microsoft(R) Word 2010
I’m pretty sure Word 2013 does a better job. I tried adding a title
and some other properties to http://5stardata.info/gtd-2.xls

and then converted to PDF using Excel 2013's built-in SaveAs to PDF.
there is an Author (Michael Hausenblas), Created date (11/12/2010)
Company (DERI). The Author and Company were retained; if I add a
title, keywords, category, it saved those too, in the XMP of the PDF
produced.  

Maybe an Excel plug-in could do more with the data itself? (For workflows
that use Excel to produce PDFs they publish)? 

>> I looked at some financial statements and found that some were very well
>> annotated and some not at all. 

Correlated to pdf:Producer?

>> Acrobat Pro has a tool that outputs the data
>> in RDF/XML;  I can't imagine it is hard to get this data out with third
>> party tools in most cases.

At this point, Acrobat Pro should be thought of as a "third party tool"
as far as ISO-32000 goes.

Pease do note that ISO-32000-2 under development should increase
the ability to use fragment identifiers of URLs to point into PDF, and
you might want to review that work for LOD-friendliness.


Larry
--
http://larry.masinter.net
Received on Wednesday, 21 January 2015 18:06:54 UTC