- From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
- Date: Mon, 06 Oct 2014 16:38:03 -0700
- To: Norman Gray <norman@astro.gla.ac.uk>, Alexander Garcia Castro <alexgarciac@gmail.com>
- CC: Linking Open Data <public-lod@w3.org>, semantic-web@w3.org
Neat. This could be extended to putting a full table of contents into the metadata, and in lots of other ways. The other nice thing about it is that it would be possible to push the same data through a LaTeX to HTML toolchain for those who want HTML output. peter On 10/06/2014 03:18 PM, Norman Gray wrote: > > Greetings. > > On 2014 Oct 6, at 19:19, Alexander Garcia Castro <alexgarciac@gmail.com> wrote: > >> querying PDFs is NOT simple and requires a lot of work -and usually >> produces lots of errors. just querying metadata is not enough. As I said >> before, I understand the PDF as something that gives me a uniform layout. >> that is ok and necessary, but not enough or sufficient within the context >> of the web of data and scientific publications. I would like to have the >> content readily available for mining purposes. if I pay for the publication >> I should get access to the publication in every format it is available. the >> content should be presented in a way so that it makes sense within the web >> of data. if it is the full content of the paper represented in RDF or XML >> fine. also, I would like to have well annotated content, this is simple and >> something that could quite easily be part of existing publication >> workflows. it may also be part of the guidelines for authors -for instance, >> identify and annotate rhetorical structures. > > > The following might add something to this conversation. > > It illustrates getting the metadata from a LaTeX file, putting it into an XMP packet in a PDF, and getting it out of the PDF as RDF. Pace Peter's mention of /Author, /Title, etc, this just focuses on the XMP packet. > > This has the document metadata, the abstract, and an illustrative bit of argumentation. Adding details about the document structure, and (RDF) pointers to any figures would be feasible, as would, I suspect, incorporating CSV files directly into the PDF. Incorporating \begin{tabular} tables would be rather tricky, but not impossible. I can't help feeling that the XHTML+RDFa equivalent would be longer and need more documentation to instruct the author where to put the RDFa magic. > > It's not very fancy, and still has rough edges, but it only took me 100 minutes, from a standing start. > > Generating and querying this PDF seems pretty simple to me. > > ---- > [...]
Received on Monday, 6 October 2014 23:38:34 UTC