- From: Norman Gray <norman@astro.gla.ac.uk>
- Date: Thu, 2 May 2013 22:34:29 +0100
- To: Alexander Garcia Castro <alexgarciac@gmail.com>
- Cc: Linking Open Data <public-lod@w3.org>
Alexander, hello. On 2013 May 2, at 21:46, Alexander Garcia Castro <alexgarciac@gmail.com> wrote: > I have a simple problem. how to extract meaningful information from > the PDF? for instance, citation data from the PDF. I would be happy if > I could extract citation data with a 70% accuracy. so far we have > tried a lot of tools and got very poor results. I would also like to > know how could I get the content of the PDF, jail break the PDF so > that I can make effective use of the content. > > I dont have anything against the PDF, I would be happy just by having > an open PDF, something that gets content free. Don't get me wrong, PDFs are a hassle to use for this purpose. But that's not because PDFs are somehow orthogonal to the deep purpose of the web; no, it's because we currently still generate PDFs in pretty dumb ways. But that's changing. And this is relevant to the public-lod list, because one of the things that I think is driving the various 'beyond PDF' efforts[1] is the pervasive idea that pervasive links are a Good Idea, and -- dammit -- it can't be _that_ hard to make your current project a lot easier, whether that's with XMP (annoyingly limited though that is), with other in-PDF annotations, or in the last resort with structured inline quasi-markup such as "doi:10.xxxx/xxxx". A problem is that not many people are familiar with the internal PDF model (I'm only loosely familiar with it), and that the libraries for destructuring and rummaging around in PDFs are not very easy to use (no need for 'jailbreaking'). Also, not everyone knows the important difference between PDF and PDF-A, and can get their workflow to produce the latter. But support like xmlincl exists [2] (ie, only one person has to work out how to get XMP into a generated PDF), and I currently have a side-project working out how best to include ORCID information in BibTeX files. Short version: it's only a couple of years since folk clearly realised the LD aspects where the general PDF workflow is weak. I predict that in five years or so, this won't be a problem any more. _This_ is an area where journals and conference editors could give more of a lead. All the best, Norman [1] I put this in quotes because 'Beyond PDF' is the most prominent and excellently-named effort in a cluster of similar efforts. [2] http://www.ctan.org/tex-archive/macros/latex/contrib/xmpincl/ -- Norman Gray : http://nxg.me.uk SUPA School of Physics and Astronomy, University of Glasgow, UK
Received on Thursday, 2 May 2013 21:34:55 UTC