Re: scientific publishing process (was Re: Cost and access) from Norman Gray on 2014-10-04 (semantic-web@w3.org from October 2014)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Sat, 4 Oct 2014 13:47:02 +0100
To: Bernadette Hyland <bhyland@3roundstones.com>
Cc: Diogo FC Patrao <djogopatrao@gmail.com>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>, semantic-web@w3.org, Linked Data community <public-lod@w3.org>
Message-Id: <69E6C389-96C9-4D89-93CA-1E4CF830BB6F@astro.gla.ac.uk>
Bernadette, hello.

On 2014 Oct 4, at 00:36, Bernadette Hyland <bhyland@3roundstones.com> wrote:

... a really useful message which pulls several of these threads together.  The following is a rather fragmentary response.

As a reference point, I tend to think "publication" = "LaTeX -> PDF".  To pre-dispel a misconception, here, I'm not being a cheerleader for PDF below, but a fair fraction of the antagonism directed towards PDF in this thread is, I think, misplaced -- PDF is not the problem.

> We'd do ourselves a huge favor if we showed (STM) publishing executives why this Linked Data stuff matters anyway.

They know.  A surprisingly large fraction of the Article Processing Charge we pay to them goes on extracting, managing and sharing metadata.  That includes DOIs, Crossref feeds, science direct, and so on and so on, and so (it seems) on.  It also includes conversion to XML: if you submit a LaTeX file to a big publisher, the first thing they'll do is convert it to XML+MathML (using workflows based on for example LaTeXML or TeX4ht) and preserve that; several of them then re-generate LaTeX for final production.

To a large extent, I suspect publishers now regard metadata management as their Job -- in the sense of their contribution to the scholarly endeavour -- and they could do without the dead trees.  If you can offer them a way of making metadata _insertion_ easier, which is cost effective, can be scaled up, and which a _broad_ range of authors will accept (the hard bit), they'll rip your arm off.

> 1) PDF works well for (STM) publishers who require fixed page display;

Yes, and for authors.  Given an alternative between an HTML version of a paper and a PDF version, I will _always_ choose the PDF, because it's zero-hassle, more reliably faithful to the author's original, more readable, and I can read it in the bath.

> 2) PDF doesn't take advantage of the advances we've made in machine readability;

If by this you mean RDF, then yes, the naive ways of generating PDFs are not RDF-aware.  So we shouldn't be naive...

XMP is an ISO standard (as PDF is, and like it originating from Adobe) and is a type of RDF (well, an irritatingly 90% profile of RDF, but let that pass).  Though it's not trivial, it's not hard to generate an XMP packet and get it into a PDF, and once there, the metadata job is mostly done.

> 3) In fact, PDFs suck on eBook readers which are all about flexible page layout; and

Sure, but they're not intended for e-book readers, so of course they're poor at that.

> 4) We already have the necessary Web Standards to address the problem, so no need to recreate the wheel.

If, again, you mean RDF, then I agree completely.

> --> Produce a Web-based tool that allows researchers to share their [privately | publicly ] funded knowledge and produces a variety of outputs: LaTeX, PDF and carries with it a machine readable representation.

Well, not web-based: I'd want something I can run on my own machine.

> Do people agree with the following SOLUTION approach?
> 
> The international standards to solve this exist. Standards from W3C and the International Digital Publishing Forum (IDPF).[2]  Use (X)HTML for generalized document creation/rendering. Use CSS for styling. Use MathML for formulas. Use JS for action. Use RDF to model the metadata within HTML.  

PDF and XMP are both ISO standards, too.  LaTeX isn't a Standard standard, but it's pretty damn stable.

MathML one would _not_ want to type.  The only ways of generating MathML, that I'm slightly familiar with, start with TeX syntax.  There are presumably GUI-based ones, too *shudder*.

> I propose a 'walk before we run' approach but do better than basic metadata (i.e., title, author name, institution, abstract).  Link to other scholarly communities/projects such as Vivo.[3]  

I generate Atom feeds for my PDF lecture notes.  The feed content is extracted from the XMP and from the /Author, /Title, etc, metadata within the PDF.  That metadata gets there automatically from the \author{...}, \title{...} metadata which is necessarily within the LaTeX source.  The pipeline isn't production quality, but it's done.  That much isn't challenging.

> We've got to show the 1,200 lb gorillas (STM publishers) why they want to come over to our part of the forest ... it isn't enough to stay with PDF to facilitate typesetting in 2015!  The Web has moved on & so must the publishers.  

While we're up our tree arguing, that din you can hear in the next clearing is the publishers spending their APCs on large-scale metadata extraction, and tearing out their hair at authors' apparent inability to follow simple instructions on how to make that easier.

(And just by the way: yes, publishers are in it for the money, ... monopoly rents..., yadda yadda, ... but I've never actually _caught_ one eating babies).

> Anything we do must be better than LaTeX in terms of ease-of-use.

Really?  What, exactly?

Word (and analogues)?  Sure, you can get metadata from WP files, but it takes a lot of heuristic effort, and requires authors to be pretty disciplined about using styles.

GUI XML editors?  I was talking to someone a couple of weeks ago who'd just completed a whole PhD detailing exactly how rubbish XML editors are in practical usability terms.

nxml-mode in Emacs?  Probably the best option for writing pointy-brackets, but still a bit painful for authoring extensive text.  And you can't write MathML.

> Publishers will make more money because their customers which include researchers & universities, will be able to discover, access and re-use data liberated from the 20th Century PDF.

That's why the publishers currently care about metadata.

----

PDFs are surprisingly flexible and open containers for transporting around Stuff (I haven't tried it, but I have little doubt you could bundle HTML, CSS and all the RDF you wanted into a PDF, should you somehow manage to devise a use-case for that).  The hard-ish bit is using that metadata in a visibly useful way -- tools tend not to rely on it, because it tends not to be there; and it tends not to be there because users don't demand it; and users don't demand it because tools don't display it.  The seriously hard bit is getting the metadata from the authors (who, to a first approximation, _really_, *really* don't care) into the PDF.

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK
Received on Saturday, 4 October 2014 12:47:25 UTC