Re: linked open data and PDF from Alfredo Serafini on 2015-01-20 (public-lod@w3.org from January 2015)

From: Alfredo Serafini <seralf@gmail.com>
Date: Tue, 20 Jan 2015 01:02:06 +0100
To: Martynas Jusevičius <martynas@graphity.org>
Cc: Paul Houle <ontology2@gmail.com>, Larry Masinter <masinter@adobe.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <CADawF4OWC-CFXbWj62kcSp=EzkicY+R_A3Y+Lm3Rsu25KAgbUg@mail.gmail.com>

sorry, I used the wrong link for aperture framework in the previous email,
this is the correct one:
http://aperture.sourceforge.net/

(bad copy+paste from the google search to find again the correct url :-))

I suppose the project is no longer mantained, and it probably used PDFBox
as well...

2015-01-20 0:58 GMT+01:00 Alfredo Serafini <seralf@gmail.com>:

> Hi
>
> from PDFbox there are indeed chances to extract metadata (if I'm not wrong
> the API are the same used inside Tika [1], which is largely adopted for
> indexing out of the box, for example in the Solr context).
> Metadata are simple name/value pairs in this context, but it's simple to
> map them directly to an explicit RDF representation.
>
> If I'm not wrong there was the project aperture [2] which does something
> similar adopting several formats, but I don't know if it's still actively
> mantained...
>
>
> [1] http://tika.apache.org/
> [2]
> https://damnhandy.com/2009/08/09/adobe-xmp-packet-extraction-for-the-aperture-framework/
>
>
> 2015-01-19 23:35 GMT+01:00 Martynas Jusevičius <martynas@graphity.org>:
>
>> PDFBox includes metadata API, but does not mention RDF:
>> https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html
>>
>> On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius
>> <martynas@graphity.org> wrote:
>> > Hey all,
>> >
>> > I think APIs for common languages like Java and C# to extract XMP RDF
>> > from PDF Files/Streams would be much more helpful than standalone
>> > tools such as Paul mentions.
>> >
>> > I've looked at Adobe PDF Library SDK but none of the features mention
>> metadata:
>> > http://www.adobe.com/devnet/pdf/library.html
>> >
>> >
>> > Martynas
>> > graphityhq.com
>> >
>> > On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle <ontology2@gmail.com>
>> wrote:
>> >> I just used Acrobat Pro to look at the XMP metadata for a standards
>> document
>> >> (extra credit if you know which one) and saw something like this
>> >>
>> >>
>> https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG
>> >>
>> >> in this particular case this is fine RDF,  just very little of it
>> because
>> >> nobody made an effort to fill it in.  The lack of a title is
>> particularly
>> >> annoying when I am reading this document at the gym because it gets
>> lost in
>> >> a maze of twisty filenames that all look the same,
>> >>
>> >> I looked at some financial statements and found that some were very
>> well
>> >> annotated and some not at all.  Acrobat Pro has a tool that outputs
>> the data
>> >> in RDF/XML;  I can't imagine it is hard to get this data out with third
>> >> party tools in most cases.
>> >>
>> >>
>> >> On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter <masinter@adobe.com>
>> wrote:
>> >>>
>> >>> I just joined this list. I’m looking to help improve the story for
>> Linked
>> >>> Open Data in PDF, to lift PDF (and other formats) from one-star to
>> five,
>> >>> perhaps using XMP. I’ve found a few hints in the mailing list archive
>> here.
>> >>> http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html
>> >>> but I’m still looking. Any clues, problem statements, sample sites?
>> >>>
>> >>> Larry
>> >>> --
>> >>> http://larry.masinter.net
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Paul Houle
>> >> Expert on Freebase, DBpedia, Hadoop and RDF
>> >> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
>> >> http://legalentityidentifier.info/lei/lookup
>>
>>
>

Received on Tuesday, 20 January 2015 00:02:34 UTC