- From: Alfredo Serafini <seralf@gmail.com>
- Date: Tue, 20 Jan 2015 01:02:06 +0100
- To: Martynas Jusevičius <martynas@graphity.org>
- Cc: Paul Houle <ontology2@gmail.com>, Larry Masinter <masinter@adobe.com>, "public-lod@w3.org" <public-lod@w3.org>
- Message-ID: <CADawF4OWC-CFXbWj62kcSp=EzkicY+R_A3Y+Lm3Rsu25KAgbUg@mail.gmail.com>
sorry, I used the wrong link for aperture framework in the previous email, this is the correct one: http://aperture.sourceforge.net/ (bad copy+paste from the google search to find again the correct url :-)) I suppose the project is no longer mantained, and it probably used PDFBox as well... 2015-01-20 0:58 GMT+01:00 Alfredo Serafini <seralf@gmail.com>: > Hi > > from PDFbox there are indeed chances to extract metadata (if I'm not wrong > the API are the same used inside Tika [1], which is largely adopted for > indexing out of the box, for example in the Solr context). > Metadata are simple name/value pairs in this context, but it's simple to > map them directly to an explicit RDF representation. > > If I'm not wrong there was the project aperture [2] which does something > similar adopting several formats, but I don't know if it's still actively > mantained... > > > [1] http://tika.apache.org/ > [2] > https://damnhandy.com/2009/08/09/adobe-xmp-packet-extraction-for-the-aperture-framework/ > > > 2015-01-19 23:35 GMT+01:00 Martynas Jusevičius <martynas@graphity.org>: > >> PDFBox includes metadata API, but does not mention RDF: >> https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html >> >> On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius >> <martynas@graphity.org> wrote: >> > Hey all, >> > >> > I think APIs for common languages like Java and C# to extract XMP RDF >> > from PDF Files/Streams would be much more helpful than standalone >> > tools such as Paul mentions. >> > >> > I've looked at Adobe PDF Library SDK but none of the features mention >> metadata: >> > http://www.adobe.com/devnet/pdf/library.html >> > >> > >> > Martynas >> > graphityhq.com >> > >> > On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle <ontology2@gmail.com> >> wrote: >> >> I just used Acrobat Pro to look at the XMP metadata for a standards >> document >> >> (extra credit if you know which one) and saw something like this >> >> >> >> >> https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG >> >> >> >> in this particular case this is fine RDF, just very little of it >> because >> >> nobody made an effort to fill it in. The lack of a title is >> particularly >> >> annoying when I am reading this document at the gym because it gets >> lost in >> >> a maze of twisty filenames that all look the same, >> >> >> >> I looked at some financial statements and found that some were very >> well >> >> annotated and some not at all. Acrobat Pro has a tool that outputs >> the data >> >> in RDF/XML; I can't imagine it is hard to get this data out with third >> >> party tools in most cases. >> >> >> >> >> >> On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter <masinter@adobe.com> >> wrote: >> >>> >> >>> I just joined this list. I’m looking to help improve the story for >> Linked >> >>> Open Data in PDF, to lift PDF (and other formats) from one-star to >> five, >> >>> perhaps using XMP. I’ve found a few hints in the mailing list archive >> here. >> >>> http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html >> >>> but I’m still looking. Any clues, problem statements, sample sites? >> >>> >> >>> Larry >> >>> -- >> >>> http://larry.masinter.net >> >>> >> >> >> >> >> >> >> >> -- >> >> Paul Houle >> >> Expert on Freebase, DBpedia, Hadoop and RDF >> >> (607) 539 6254 paul.houle on Skype ontology2@gmail.com >> >> http://legalentityidentifier.info/lei/lookup >> >> >
Received on Tuesday, 20 January 2015 00:02:34 UTC