W3C home > Mailing lists > Public > semantic-web@w3.org > October 2014

Re: scientific publishing process (was Re: Cost and access)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Mon, 6 Oct 2014 23:18:13 +0100
Cc: Linking Open Data <public-lod@w3.org>, semantic-web@w3.org
Message-Id: <126C92B4-031C-4956-B6D4-53287AFA29F9@astro.gla.ac.uk>
To: Alexander Garcia Castro <alexgarciac@gmail.com>

Greetings.

On 2014 Oct 6, at 19:19, Alexander Garcia Castro <alexgarciac@gmail.com> wrote:

> querying PDFs is NOT simple and requires a lot of work -and usually
> produces lots of errors. just querying metadata is not enough. As I said
> before, I understand the PDF as something that gives me a uniform layout.
> that is ok and necessary, but not enough or sufficient within the context
> of the web of data and scientific publications. I would like to have the
> content readily available for mining purposes. if I pay for the publication
> I should get access to the publication in every format it is available. the
> content should be presented in a way so that it makes sense within the web
> of data.  if it is the full content of the paper represented in RDF or XML
> fine. also, I would like to have well annotated content, this is simple and
> something that could quite easily be part of existing publication
> workflows. it may also be part of the guidelines for authors -for instance,
> identify and annotate rhetorical structures.


The following might add something to this conversation.

It illustrates getting the metadata from a LaTeX file, putting it into an XMP packet in a PDF, and getting it out of the PDF as RDF.  Pace Peter's mention of /Author, /Title, etc, this just focuses on the XMP packet.

This has the document metadata, the abstract, and an illustrative bit of argumentation.  Adding details about the document structure, and (RDF) pointers to any figures would be feasible, as would, I suspect, incorporating CSV files directly into the PDF.  Incorporating \begin{tabular} tables would be rather tricky, but not impossible.  I can't help feeling that the XHTML+RDFa equivalent would be longer and need more documentation to instruct the author where to put the RDFa magic.

It's not very fancy, and still has rough edges, but it only took me 100 minutes, from a standing start.

Generating and querying this PDF seems pretty simple to me.

----

$ cat test-xmp.tex
\documentclass{article}

\usepackage{xmp-management}

\title{This is a test file}
\author{Norman Gray}
\date{2014 October 6}

\begin{document}

\maketitle

\abstract{It's easy to include metadata in \LaTeX\ files.

That's because there's plenty of metadata in there already.}

There is text and metatext within files.

\section{Further details}

In this section we could potentially discuss moving information
around.  I think we can assert that \claim{it is easy to move
  information around}, and, further, that \claim{making metadata
  readily available is a Good Thing}.  I hope that clears that up.
\end{document}
$ cat xmp-management.sty 
\ProvidesPackage{xmp-management}[2014/10/06]

\newwrite\xmp@ttlfile
\def\xmp@open{\immediate\openout\xmp@ttlfile \jobname.ttl
  \let\xmp@open\relax}
\long\def\xmp@stmt#1#2{%
  \xmp@open
  \write\xmp@ttlfile{<> #1 """#2""".}}
\let\xmp@origtitle\title
\def\title#1{\xmp@stmt{dc:title}{#1}\xmp@origtitle{#1}}
\let\xmp@origauthor\author
\def\author#1{\xmp@stmt{dc:creator}{#1}\xmp@origauthor{#1}}
\let\xmp@origdate\date
\def\date#1{\xmp@stmt{dc:created}{#1}\xmp@origdate{#1}}

\long\def\abstract#1{
  \xmp@stmt{dc:abstract}{#1}
  \begin{quotation}\textbf{Abstract:} #1\end{quotation}}
\def\claim#1{
  \xmp@stmt{xmpinfo:claim}{#1}
  \emph{#1}}

\let\xmp@origsection\section
\def\section#1{\xmp@stmt{xmpinfo:has_section}{#1}
  \xmp@origsection{#1}}

\usepackage{xmpincl}
\AtBeginDocument{\includexmp{info}}
$ pdflatex test-xmp 
This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012)
 restricted \write18 enabled.
entering extended mode
(./test-xmp.tex
LaTeX2e <2011/06/27>
[...BLAH...]
Output written on test-xmp.pdf (1 page, 75667 bytes).
Transcript written on test-xmp.log.
$ cat test-xmp.ttl
<> dc:title """This is a test file""".
<> dc:creator """Norman Gray""".
<> dc:created """2014 October 6""".
<> dc:abstract """It's easy to include metadata in \LaTeX  \ files. \par That's because there's plenty of metadata in there already.""".
<> xmpinfo:has_section """Further details""".
<> xmpinfo:claim """it is easy to move information around""".
<> xmpinfo:claim """making metadata readily available is a Good Thing""".
$ make info.xmp
sed 's/\\//g' test-xmp.ttl | \
	  cat prefix.ttl - | \
	  rapper -iturtle -ordfxml-xmp -q - file:test-xmp.pdf | \
	  sed '/<\?xpacket/d' >info.xmp.tmp && mv info.xmp.tmp info.xmp
$ pdflatex test-xmp 
This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012)
 restricted \write18 enabled.
entering extended mode
(./test-xmp.tex
LaTeX2e <2011/06/27>
[...BLAH...]
Output written on test-xmp.pdf (1 page, 77069 bytes).
Transcript written on test-xmp.log.
$ make extract-xmp   
cc -Wall -o extract-xmp extract-xmp.c
$ ./extract-xmp test-xmp.pdf
<rdf:RDF xmlns:cc="http://creativecommons.org/ns#" 
xmlns:dc="http://purl.org/dc/elements/1.1/" 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
xmlns:xapRights="http://ns.adobe.com/xap/1.0/rights/" 
xmlns:xmpinfo="http://example.org/xmpinfo" 
xml:base="file:test-xmp.pdf"> 
<rdf:Description rdf:about=""> 
<cc:license rdf:resource="http://creativecommons.org/licenses/by-nc-nd/4.0/"/> 
<xmpinfo:claim>it is easy to move information around</xmpinfo:claim> 
<xmpinfo:has_section>Further details</xmpinfo:has_section> 
<xapRights:Marked>True</xapRights:Marked> 
<xapRights:UsageTerms> 
<rdf:Alt> 
<rdf:li xml:lang="x-default">This work is licensed under a &lt;a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"&gt;Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License&lt;/a&gt;.</rdf:li> 
</rdf:Alt> 
</xapRights:UsageTerms> 
<dc:abstract>It's easy to include metadata in LaTeX files. par That's because there's plenty of metadata in there already.</dc:abstract> 
<dc:created>2014 October 6</dc:created> 
<dc:creator>Norman Gray</dc:creator> 
<dc:title>This is a test file</dc:title> 
</rdf:Description> 
</rdf:RDF>
$ 


----

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK
Received on Monday, 6 October 2014 22:18:31 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:49:25 UTC