- From: Paul Houle <ontology2@gmail.com>
- Date: Wed, 21 Jan 2015 10:04:29 -0500
- To: Martynas Jusevičius <martynas@graphity.org>
- Cc: Larry Masinter <masinter@adobe.com>, "public-lod@w3.org" <public-lod@w3.org>
- Message-ID: <CAE__kdS02x81+yikUMoMFh6pC6Qdd2DC_vhvk4w4pY-V9=FXAw@mail.gmail.com>
You should be able to pipe the InputStream that comes out of a PDF filew/ PDFBox into Jena or some other RDF toolset. A much more challenging issue is developing a serializer which will take some triples from Jena (or other toolset) and make sure 100% that it will validate as XMP. I'm thinking of a user story that I have a large number of PDF documents on my personal computer and several mobile devices. I read standards documents the way I used to read science fiction as a kid. I read great classics of science fiction in PDF, get bank statements in PDF, send marketing material to customers as PDF, I want to be on top of my PDF. I think the same is true for many foaf:Agents. One issue I see is what it is all rdf:About, or what "" means. In the context of a single document this is not a problem but a simple "linking" scenario would be to scan a collection of PDF documents and get all of the RDF into a triple store. If I look at something that's a bad smell to me it is the use of a non-standard date, in particular something derived from http://www.w3.org/TR/NOTE-datetime which, like xsd:datetime, is a derivative of the ISO 8601 standard which (like IEEE 754) has never been read by anybody. Unlike xsd:datetime this allows dropping components off the RHS, but at least it doesn't allow the fifth digit in the year that xsd:datetime does. XML schema does have proper types for "dates" and "times" and "years" and similar entities but the SPARQL standard does not implement an algebra that handles non-exact datetimes properly. (It should, but there are lots of issues, such as this is an algebra of intervals so there not always a total ordering.) Another thing that bugs me are text lists in the specification for values that are enumerated in the text files. For instance, looking at the the descriptions of channel formats for sound I see the is an "Other" in the recent specification but not way to specify any of the three competing formats that support a height channel, or that surround sound is likely to migrate to being object based rather than channel based. (Never mind the 4.0 quad mixes from the 1970s) It would be nice to see some way to keep this up to date. And about "ISO Standard", Adobe's messaging ought to be very clear that you can download these standards for free because that's not usually true about "ISO Standards". We all need to get revenue, but when official standards are not freely available to end users they don't end up getting used properly. On Mon, Jan 19, 2015 at 5:35 PM, Martynas Jusevičius <martynas@graphity.org> wrote: > PDFBox includes metadata API, but does not mention RDF: > https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html > > On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius > <martynas@graphity.org> wrote: > > Hey all, > > > > I think APIs for common languages like Java and C# to extract XMP RDF > > from PDF Files/Streams would be much more helpful than standalone > > tools such as Paul mentions. > > > > I've looked at Adobe PDF Library SDK but none of the features mention > metadata: > > http://www.adobe.com/devnet/pdf/library.html > > > > > > Martynas > > graphityhq.com > > > > On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle <ontology2@gmail.com> > wrote: > >> I just used Acrobat Pro to look at the XMP metadata for a standards > document > >> (extra credit if you know which one) and saw something like this > >> > >> > https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG > >> > >> in this particular case this is fine RDF, just very little of it > because > >> nobody made an effort to fill it in. The lack of a title is > particularly > >> annoying when I am reading this document at the gym because it gets > lost in > >> a maze of twisty filenames that all look the same, > >> > >> I looked at some financial statements and found that some were very well > >> annotated and some not at all. Acrobat Pro has a tool that outputs the > data > >> in RDF/XML; I can't imagine it is hard to get this data out with third > >> party tools in most cases. > >> > >> > >> On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter <masinter@adobe.com> > wrote: > >>> > >>> I just joined this list. I’m looking to help improve the story for > Linked > >>> Open Data in PDF, to lift PDF (and other formats) from one-star to > five, > >>> perhaps using XMP. I’ve found a few hints in the mailing list archive > here. > >>> http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html > >>> but I’m still looking. Any clues, problem statements, sample sites? > >>> > >>> Larry > >>> -- > >>> http://larry.masinter.net > >>> > >> > >> > >> > >> -- > >> Paul Houle > >> Expert on Freebase, DBpedia, Hadoop and RDF > >> (607) 539 6254 paul.houle on Skype ontology2@gmail.com > >> http://legalentityidentifier.info/lei/lookup > -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254 paul.houle on Skype ontology2@gmail.com http://legalentityidentifier.info/lei/lookup
Received on Wednesday, 21 January 2015 15:04:57 UTC