Re: linked open data and PDF from Alfredo Serafini on 2015-01-19 (public-lod@w3.org from January 2015)

From: Alfredo Serafini <seralf@gmail.com>
Date: Tue, 20 Jan 2015 00:58:36 +0100
To: Martynas Jusevičius <martynas@graphity.org>
Cc: Paul Houle <ontology2@gmail.com>, Larry Masinter <masinter@adobe.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <CADawF4M+MAKJOxAtUYnODK3s30_Nmn9Fz-=+tewxFXVqE_TYsw@mail.gmail.com>

Hi

from PDFbox there are indeed chances to extract metadata (if I'm not wrong
the API are the same used inside Tika [1], which is largely adopted for
indexing out of the box, for example in the Solr context).
Metadata are simple name/value pairs in this context, but it's simple to
map them directly to an explicit RDF representation.

If I'm not wrong there was the project aperture [2] which does something
similar adopting several formats, but I don't know if it's still actively
mantained...


[1] http://tika.apache.org/
[2]
https://damnhandy.com/2009/08/09/adobe-xmp-packet-extraction-for-the-aperture-framework/


2015-01-19 23:35 GMT+01:00 Martynas Jusevičius <martynas@graphity.org>:

> PDFBox includes metadata API, but does not mention RDF:
> https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html
>
> On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius
> <martynas@graphity.org> wrote:
> > Hey all,
> >
> > I think APIs for common languages like Java and C# to extract XMP RDF
> > from PDF Files/Streams would be much more helpful than standalone
> > tools such as Paul mentions.
> >
> > I've looked at Adobe PDF Library SDK but none of the features mention
> metadata:
> > http://www.adobe.com/devnet/pdf/library.html
> >
> >
> > Martynas
> > graphityhq.com
> >
> > On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle <ontology2@gmail.com>
> wrote:
> >> I just used Acrobat Pro to look at the XMP metadata for a standards
> document
> >> (extra credit if you know which one) and saw something like this
> >>
> >>
> https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG
> >>
> >> in this particular case this is fine RDF,  just very little of it
> because
> >> nobody made an effort to fill it in.  The lack of a title is
> particularly
> >> annoying when I am reading this document at the gym because it gets
> lost in
> >> a maze of twisty filenames that all look the same,
> >>
> >> I looked at some financial statements and found that some were very well
> >> annotated and some not at all.  Acrobat Pro has a tool that outputs the
> data
> >> in RDF/XML;  I can't imagine it is hard to get this data out with third
> >> party tools in most cases.
> >>
> >>
> >> On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter <masinter@adobe.com>
> wrote:
> >>>
> >>> I just joined this list. I’m looking to help improve the story for
> Linked
> >>> Open Data in PDF, to lift PDF (and other formats) from one-star to
> five,
> >>> perhaps using XMP. I’ve found a few hints in the mailing list archive
> here.
> >>> http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html
> >>> but I’m still looking. Any clues, problem statements, sample sites?
> >>>
> >>> Larry
> >>> --
> >>> http://larry.masinter.net
> >>>
> >>
> >>
> >>
> >> --
> >> Paul Houle
> >> Expert on Freebase, DBpedia, Hadoop and RDF
> >> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
> >> http://legalentityidentifier.info/lei/lookup
>
>

Received on Monday, 19 January 2015 23:59:09 UTC