- From: Alfredo Serafini <seralf@gmail.com>
- Date: Tue, 20 Jan 2015 00:58:36 +0100
- To: Martynas Jusevičius <martynas@graphity.org>
- Cc: Paul Houle <ontology2@gmail.com>, Larry Masinter <masinter@adobe.com>, "public-lod@w3.org" <public-lod@w3.org>
- Message-ID: <CADawF4M+MAKJOxAtUYnODK3s30_Nmn9Fz-=+tewxFXVqE_TYsw@mail.gmail.com>
Hi from PDFbox there are indeed chances to extract metadata (if I'm not wrong the API are the same used inside Tika [1], which is largely adopted for indexing out of the box, for example in the Solr context). Metadata are simple name/value pairs in this context, but it's simple to map them directly to an explicit RDF representation. If I'm not wrong there was the project aperture [2] which does something similar adopting several formats, but I don't know if it's still actively mantained... [1] http://tika.apache.org/ [2] https://damnhandy.com/2009/08/09/adobe-xmp-packet-extraction-for-the-aperture-framework/ 2015-01-19 23:35 GMT+01:00 Martynas Jusevičius <martynas@graphity.org>: > PDFBox includes metadata API, but does not mention RDF: > https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html > > On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius > <martynas@graphity.org> wrote: > > Hey all, > > > > I think APIs for common languages like Java and C# to extract XMP RDF > > from PDF Files/Streams would be much more helpful than standalone > > tools such as Paul mentions. > > > > I've looked at Adobe PDF Library SDK but none of the features mention > metadata: > > http://www.adobe.com/devnet/pdf/library.html > > > > > > Martynas > > graphityhq.com > > > > On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle <ontology2@gmail.com> > wrote: > >> I just used Acrobat Pro to look at the XMP metadata for a standards > document > >> (extra credit if you know which one) and saw something like this > >> > >> > https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG > >> > >> in this particular case this is fine RDF, just very little of it > because > >> nobody made an effort to fill it in. The lack of a title is > particularly > >> annoying when I am reading this document at the gym because it gets > lost in > >> a maze of twisty filenames that all look the same, > >> > >> I looked at some financial statements and found that some were very well > >> annotated and some not at all. Acrobat Pro has a tool that outputs the > data > >> in RDF/XML; I can't imagine it is hard to get this data out with third > >> party tools in most cases. > >> > >> > >> On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter <masinter@adobe.com> > wrote: > >>> > >>> I just joined this list. I’m looking to help improve the story for > Linked > >>> Open Data in PDF, to lift PDF (and other formats) from one-star to > five, > >>> perhaps using XMP. I’ve found a few hints in the mailing list archive > here. > >>> http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html > >>> but I’m still looking. Any clues, problem statements, sample sites? > >>> > >>> Larry > >>> -- > >>> http://larry.masinter.net > >>> > >> > >> > >> > >> -- > >> Paul Houle > >> Expert on Freebase, DBpedia, Hadoop and RDF > >> (607) 539 6254 paul.houle on Skype ontology2@gmail.com > >> http://legalentityidentifier.info/lei/lookup > >
Received on Monday, 19 January 2015 23:59:09 UTC