- From: Alexander Garcia Castro <alexgarciac@gmail.com>
- Date: Thu, 2 May 2013 23:49:31 +0200
- To: Norman Gray <norman@astro.gla.ac.uk>, Casey McLaughlin <casey.mclaughlin@cci.fsu.edu>
- Cc: Linking Open Data <public-lod@w3.org>
Hi Norman, I have heard the same from ADOBE people. its not the PDF it is YOU not wise enough as to know how to generate a PDF. Unfortunately, I dont work with PDFs generated by me, I have to deal with those coming from publishers; probably they should attend a training for generating PDFs. It is great to hear ":libraries for destructuring and rummaging around in PDFs are not very easy to use (no need for 'jailbreaking'". Please point us to such libraries and tutorials for destructuring the PDF. So far, for practical purposes the content is locked up and in deep need for jail braking so that it can be effectively used. But, as u pointed out, it may be just because we dont know how to generate PDFs. BTW, I am ccing this to Casey, we work together and we are eager to hear about those libraries. On Thu, May 2, 2013 at 11:34 PM, Norman Gray <norman@astro.gla.ac.uk> wrote: > > Alexander, hello. > > On 2013 May 2, at 21:46, Alexander Garcia Castro <alexgarciac@gmail.com> wrote: > >> I have a simple problem. how to extract meaningful information from >> the PDF? for instance, citation data from the PDF. I would be happy if >> I could extract citation data with a 70% accuracy. so far we have >> tried a lot of tools and got very poor results. I would also like to >> know how could I get the content of the PDF, jail break the PDF so >> that I can make effective use of the content. >> >> I dont have anything against the PDF, I would be happy just by having >> an open PDF, something that gets content free. > > Don't get me wrong, PDFs are a hassle to use for this purpose. But that's not because PDFs are somehow orthogonal to the deep purpose of the web; no, it's because we currently still generate PDFs in pretty dumb ways. > > But that's changing. And this is relevant to the public-lod list, because one of the things that I think is driving the various 'beyond PDF' efforts[1] is the pervasive idea that pervasive links are a Good Idea, and -- dammit -- it can't be _that_ hard to make your current project a lot easier, whether that's with XMP (annoyingly limited though that is), with other in-PDF annotations, or in the last resort with structured inline quasi-markup such as "doi:10.xxxx/xxxx". > > A problem is that not many people are familiar with the internal PDF model (I'm only loosely familiar with it), and that the libraries for destructuring and rummaging around in PDFs are not very easy to use (no need for 'jailbreaking'). Also, not everyone knows the important difference between PDF and PDF-A, and can get their workflow to produce the latter. But support like xmlincl exists [2] (ie, only one person has to work out how to get XMP into a generated PDF), and I currently have a side-project working out how best to include ORCID information in BibTeX files. > > Short version: it's only a couple of years since folk clearly realised the LD aspects where the general PDF workflow is weak. I predict that in five years or so, this won't be a problem any more. > > _This_ is an area where journals and conference editors could give more of a lead. > > All the best, > > Norman > > > > [1] I put this in quotes because 'Beyond PDF' is the most prominent and excellently-named effort in a cluster of similar efforts. > [2] http://www.ctan.org/tex-archive/macros/latex/contrib/xmpincl/ > > > > -- > Norman Gray : http://nxg.me.uk > SUPA School of Physics and Astronomy, University of Glasgow, UK > -- Alexander Garcia http://www.alexandergarcia.name/ http://www.usefilm.com/photographer/75943.html http://www.linkedin.com/in/alexgarciac
Received on Thursday, 2 May 2013 21:50:19 UTC