Re: linked open data and PDF from Paul Houle on 2015-01-21 (public-lod@w3.org from January 2015)

From: Paul Houle <ontology2@gmail.com>
Date: Wed, 21 Jan 2015 10:04:29 -0500
To: Martynas Jusevičius <martynas@graphity.org>
Cc: Larry Masinter <masinter@adobe.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <CAE__kdS02x81+yikUMoMFh6pC6Qdd2DC_vhvk4w4pY-V9=FXAw@mail.gmail.com>
You should be able to pipe the InputStream that comes out of a PDF filew/
PDFBox into Jena or some other RDF toolset.  A much more challenging issue
is developing a serializer which will take some triples from Jena (or other
toolset) and make sure 100% that it will validate as XMP.

I'm thinking of a user story that I have a large number of PDF documents on
my personal computer and several mobile devices.  I read standards
documents the way I used to read science fiction as a kid.  I read great
classics of science fiction in PDF,  get bank statements in PDF,  send
marketing material to customers as PDF,  I want to be on top of my PDF.

I think the same is true for many foaf:Agents.

One issue I see is what it is all rdf:About,  or what "" means.  In the
context of a single document this is not a problem but a simple "linking"
scenario would be to scan a collection of PDF documents and get all of the
RDF into a triple store.

If I look at something that's a bad smell to me it is the use of a
non-standard date,  in particular something derived from

http://www.w3.org/TR/NOTE-datetime

which,  like xsd:datetime,  is a derivative of the ISO 8601 standard which
(like IEEE 754) has never been read by anybody. Unlike xsd:datetime this
allows dropping components off the RHS,  but at least it doesn't allow the
fifth digit in the year that xsd:datetime does.

XML schema does have proper types for "dates" and "times" and "years" and
similar entities but the SPARQL standard does not implement an algebra that
handles non-exact datetimes properly.  (It should,  but there are lots of
issues,  such as this is an algebra of intervals so there not always a
total ordering.)

Another thing that bugs me are text lists in the specification for values
that are enumerated in the text files.  For instance,  looking at the the
descriptions of channel formats for sound I see the is an "Other" in the
recent specification but not way to specify any of the three competing
formats that support a height channel,  or that surround sound is likely to
migrate to being object based rather than channel based.  (Never mind the
4.0 quad mixes from the 1970s)  It would be nice to see some way to keep
this up to date.

And about "ISO Standard",  Adobe's messaging ought to be very clear that
you can download these standards for free because that's not usually true
about "ISO Standards".  We all need to get revenue,  but when official
standards are not freely available to end users they don't end up getting
used properly.



On Mon, Jan 19, 2015 at 5:35 PM, Martynas Jusevičius <martynas@graphity.org>
wrote:

> PDFBox includes metadata API, but does not mention RDF:
> https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html
>
> On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius
> <martynas@graphity.org> wrote:
> > Hey all,
> >
> > I think APIs for common languages like Java and C# to extract XMP RDF
> > from PDF Files/Streams would be much more helpful than standalone
> > tools such as Paul mentions.
> >
> > I've looked at Adobe PDF Library SDK but none of the features mention
> metadata:
> > http://www.adobe.com/devnet/pdf/library.html
> >
> >
> > Martynas
> > graphityhq.com
> >
> > On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle <ontology2@gmail.com>
> wrote:
> >> I just used Acrobat Pro to look at the XMP metadata for a standards
> document
> >> (extra credit if you know which one) and saw something like this
> >>
> >>
> https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG
> >>
> >> in this particular case this is fine RDF,  just very little of it
> because
> >> nobody made an effort to fill it in.  The lack of a title is
> particularly
> >> annoying when I am reading this document at the gym because it gets
> lost in
> >> a maze of twisty filenames that all look the same,
> >>
> >> I looked at some financial statements and found that some were very well
> >> annotated and some not at all.  Acrobat Pro has a tool that outputs the
> data
> >> in RDF/XML;  I can't imagine it is hard to get this data out with third
> >> party tools in most cases.
> >>
> >>
> >> On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter <masinter@adobe.com>
> wrote:
> >>>
> >>> I just joined this list. I’m looking to help improve the story for
> Linked
> >>> Open Data in PDF, to lift PDF (and other formats) from one-star to
> five,
> >>> perhaps using XMP. I’ve found a few hints in the mailing list archive
> here.
> >>> http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html
> >>> but I’m still looking. Any clues, problem statements, sample sites?
> >>>
> >>> Larry
> >>> --
> >>> http://larry.masinter.net
> >>>
> >>
> >>
> >>
> >> --
> >> Paul Houle
> >> Expert on Freebase, DBpedia, Hadoop and RDF
> >> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
> >> http://legalentityidentifier.info/lei/lookup
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
http://legalentityidentifier.info/lei/lookup
Received on Wednesday, 21 January 2015 15:04:57 UTC