Re: linked open data and PDF from Norman Gray on 2015-01-21 (public-lod@w3.org from January 2015)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Wed, 21 Jan 2015 16:01:05 +0000
To: Herbert Van de Sompel <hvdsomp@gmail.com>
Cc: "jschneider@pobox.com" <jschneider@pobox.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-Id: <C9648AAB-7AB1-41CC-AB66-00951A581ACF@astro.gla.ac.uk>

Greetings.

> On 2015 Jan 20, at 14:42, Herbert Van de Sompel <hvdsomp@gmail.com> wrote:
> 
>> Larry
>> How about HTTP Link headers (RFC 5988) to convey links and metadata expressed as links when serving PDFs? I can imagine an authoring tool embedding the info in XMP. But I have a harder time imagining a consumer application that would want to read the info via XMP.  
>> 
>> I don't: Bibliographic managers for PDFs could make use of XMP metadata. Imagine never typing another citation again!
> 
> I thought bibliographic managers already did. If I remember correctly, some STM publishers did stuff metadata in XMP containers at one point. And I would have thought that bibliographic managers would then have used that. It would be telling if they didn't.

And similarly I can imagine a PDF viewer -- or MP3 player, or web browser processing images -- looking in an XMP packet to find title and licence information, or a MP3-managing application using metadata found there to cluster things based on publisher, country, licence again, and so on; or "find me all the content with a licence I can mash up...".  Metadata good!

There's a vicious circle, though.  Developers aren't aware of RDF in general and XMP in particular, so don't look for information there.  If they do, they don't know how to parse it (XMP is a profile of RDF/XML, so it's easy to read but a bit of a pain in the neck to write (though librdf's Raptor can do it <http://librdf.org/raptor/>)).  If they look for parsers, then they're likely to find one of the biiiig SemWeb frameworks rather than a little parsing library they can drop into their application.  So they give up, even if they get that far.  And because (approximately) no-one reads this stuff, no-one writes it, and because no-one writes it, no-one forces developers to push through the pain and read it.

What would break that deadlock would be (i) a killer tool depending on XMP, which makes its users nag content producers to include the stuff, or (ii) content producers routinely making the stuff available, in the hope that (i) turns up.  Hmm: I'm not holding my breath.

Your other proposal:

> But, anyhow, I admit my above sentence wasn't all that clear. I was really expressing my doubt that an "on web" Linked Data aggregating tool would start doing something special when encountering a PDF: pulling it across and then check whether anything interesting was in an XMP container.  In my below proposal, an HTTP HEAD (needed anyhow to figure out whether a resource is a PDF) would suffice to obtain the links if they were provided in the HTTP Link header. Using IANA-registered relation types, those links would end up being a bit generic but they would be readily available. And rather easily transformable into RDF.

...is interesting, but I don't think you necessarily need to involve the web to make an interesting scenario.  It's an odd thing to enthuse about on a semweb list, but the nice thing about embedded XMP is ... that it's embedded, so it can't get lost, and no ConNeg agony is involved in its extraction!

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

Received on Wednesday, 21 January 2015 16:01:34 UTC