W3C home > Mailing lists > Public > semantic-web@w3.org > October 2014

Re: scientific publishing process (was Re: Cost and access)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Mon, 06 Oct 2014 18:58:28 -0700
Message-ID: <543348C4.4080804@gmail.com>
To: Kingsley Idehen <kidehen@openlinksw.com>, "semantic-web@w3.org" <semantic-web@w3.org>
On 10/06/2014 06:19 PM, Kingsley Idehen wrote:
> On 10/6/14 2:49 PM, Peter F. Patel-Schneider wrote:
>> On 10/06/2014 11:03 AM, Kingsley Idehen wrote:

[Metadata discussion removed to concentrate on the other part of the discussion.]

>>> 1. The extractors are platform specific -- AWWW is about platform agnosticism
>>> (I don't want to mandate an OS for experiencing the power of Linked Open Data
>>> transformers / rdfizers)
>> Well, the extractors would be specific to PDF, but that's hardly surprising,
>> I think.
>>> 2. It isn't solely about metadata  -- we also have raw data inside these
>>> documents confined to Tables, paragraphs of sentences
>> Well, sure, but is extracting information directly from the figures or
>> tables or text being considered here?  I sure would like this to be
>> possible.  How would it work in an HTML context?
> Each table is a Class.
> Each table record is an instance of the Class represented by the table.
> Each table field is a property of a Class represented by the table
> Each table field value's data type can be used to discern the range of each
> Class property.
> Depending on what the sentences and paragraphs are about you can make an RDF
> statement per sentence.

But to do all this you need to add extra information to the document.  Where 
is the tooling for that?  Do you want to make authors enter the markup by 
hand?  I think that some tooling help is needed here.

>>> 3. If querying a PDF was marginally simple, I would be demonstrating that
>>> using a SPARQL results URL in response to this post
>> I'm not saying that it is so simple.  You do have to find the metadata block
>> in the PDF and then look for the /Title, /Author, ... stuff.
> But it could be simple if PDF didn't have the issues I outlined in regards to
> extraction technology. Funnily enough, there's a massive opportunity for Adobe
> to solve this problem, especially as they've now ventured heavily into cloud
> enabling their technologies, If they provide APIs from the cloud, this problem
> could become much simpler to address in regards to productive solutions where
> PDFs become less of the data silos that they are today.

I believe that we have a demonstration from Norman Gray that RDF data can be 
entered in LaTeX, put in PDF, and then extracted quite easily.  The use of 
LaTeX, with its macro facilities, makes entering some information painless and 
makes entering other information easy.

[More metadata discussion removed.]

>>> We want to leverage the productivity and simplicity that AWWW brings to data
>>> representation, access, interaction, and integration.
>> Sure, but the additional costs, if any, on paper authors, reviewers, and
>> readers have to be considered.  If these costs are eliminated or at least
>> minimized then this good is much more likely to be realized.
> With some help from Adobe we can have the best of all worlds here. I am going
> to take a look at their latest cloud offerings and associated APIs.

It appears to me that no help is needed from Adobe.

If you want to have documents that are not only reasonable to produce, review, 
and read, but that also have useful RDF information emmbedded in them, then I 
suggest expanding on the approach just put together by Norman Gray.  His 
approach can be used with LaTeX and PDF, meaning that anyone can start using 
it and their papers would be acceptable at any conference or journal that 
wants PDF (albeit maybe with some fancy footwork for those venues that rerun 
authors' sources through LaTeX).  It should be easy to make this approach work 
with any system that produces HTML from LaTeX---all that is required is to 
augment the HTML generation tools to handle the data insertion primitives.

Received on Tuesday, 7 October 2014 01:59:04 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:49:25 UTC