Re: linked open data and PDF

There’s a survey by Ross Mounce here: http://rossmounce.co.uk/2013/01/06/pdf-metadata-using-exiftool/


Regards

Rod

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email:  Roderic.Page@glasgow.ac.uk<mailto:Roderic.Page@glasgow.ac.uk>
Tel:  +44 141 330 4778
Skype:  rdmpage
Facebook:  http://www.facebook.com/rdmpage

LinkedIn:  http://uk.linkedin.com/in/rdmpage

Twitter:  http://twitter.com/rdmpage

Blog:  http://iphylo.blogspot.com

ORCID:  http://orcid.org/0000-0002-7101-9767

Citations:  http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ

ResearchGate https://www.researchgate.net/profile/Roderic_Page



On 21 Jan 2015, at 16:32, Paul Houle <ontology2@gmail.com<mailto:ontology2@gmail.com>> wrote:

      I think the world needs a survey of XMP metadata in the field.  Only by inspection of a large set of diverse files can we say how good or bad the situation actually is.

      There ought to be a tool that gives XMP-annotated documents a point score for metadata quality;  you ought to get a lot of points for having the simple things that were missing in the document exported from word like the title, author,  copyright,  etc.

      Note it is not just about PDF but many kinds of media files that are tagged with this,  so it really is about XMP,  not just PDF.

On Wed, Jan 21, 2015 at 11:01 AM, Norman Gray <norman@astro.gla.ac.uk<mailto:norman@astro.gla.ac.uk>> wrote:

Greetings.

> On 2015 Jan 20, at 14:42, Herbert Van de Sompel <hvdsomp@gmail.com<mailto:hvdsomp@gmail.com>> wrote:
>
>> Larry
>> How about HTTP Link headers (RFC 5988) to convey links and metadata expressed as links when serving PDFs? I can imagine an authoring tool embedding the info in XMP. But I have a harder time imagining a consumer application that would want to read the info via XMP.
>>
>> I don't: Bibliographic managers for PDFs could make use of XMP metadata. Imagine never typing another citation again!
>
> I thought bibliographic managers already did. If I remember correctly, some STM publishers did stuff metadata in XMP containers at one point. And I would have thought that bibliographic managers would then have used that. It would be telling if they didn't.

And similarly I can imagine a PDF viewer -- or MP3 player, or web browser processing images -- looking in an XMP packet to find title and licence information, or a MP3-managing application using metadata found there to cluster things based on publisher, country, licence again, and so on; or "find me all the content with a licence I can mash up...".  Metadata good!

There's a vicious circle, though.  Developers aren't aware of RDF in general and XMP in particular, so don't look for information there.  If they do, they don't know how to parse it (XMP is a profile of RDF/XML, so it's easy to read but a bit of a pain in the neck to write (though librdf's Raptor can do it <http://librdf.org/raptor/>)).  If they look for parsers, then they're likely to find one of the biiiig SemWeb frameworks rather than a little parsing library they can drop into their application.  So they give up, even if they get that far.  And because (approximately) no-one reads this stuff, no-one writes it, and because no-one writes it, no-one forces developers to push through the pain and read it.

What would break that deadlock would be (i) a killer tool depending on XMP, which makes its users nag content producers to include the stuff, or (ii) content producers routinely making the stuff available, in the hope that (i) turns up.  Hmm: I'm not holding my breath.

Your other proposal:

> But, anyhow, I admit my above sentence wasn't all that clear. I was really expressing my doubt that an "on web" Linked Data aggregating tool would start doing something special when encountering a PDF: pulling it across and then check whether anything interesting was in an XMP container.  In my below proposal, an HTTP HEAD (needed anyhow to figure out whether a resource is a PDF) would suffice to obtain the links if they were provided in the HTTP Link header. Using IANA-registered relation types, those links would end up being a bit generic but they would be readily available. And rather easily transformable into RDF.

...is interesting, but I don't think you necessarily need to involve the web to make an interesting scenario.  It's an odd thing to enthuse about on a semweb list, but the nice thing about embedded XMP is ... that it's embedded, so it can't get lost, and no ConNeg agony is involved in its extraction!

All the best,

Norman


--
Norman Gray  :  http://nxg.me.uk<http://nxg.me.uk/>
SUPA School of Physics and Astronomy, University of Glasgow, UK





--
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com<mailto:ontology2@gmail.com>
http://legalentityidentifier.info/lei/lookup

Received on Friday, 23 January 2015 09:08:18 UTC