- From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
- Date: Mon, 06 Oct 2014 18:58:28 -0700
- To: Kingsley Idehen <kidehen@openlinksw.com>, "semantic-web@w3.org" <semantic-web@w3.org>
On 10/06/2014 06:19 PM, Kingsley Idehen wrote: > On 10/6/14 2:49 PM, Peter F. Patel-Schneider wrote: >> >> >> On 10/06/2014 11:03 AM, Kingsley Idehen wrote: [Metadata discussion removed to concentrate on the other part of the discussion.] >>> 1. The extractors are platform specific -- AWWW is about platform agnosticism >>> (I don't want to mandate an OS for experiencing the power of Linked Open Data >>> transformers / rdfizers) >> >> Well, the extractors would be specific to PDF, but that's hardly surprising, >> I think. >> >>> 2. It isn't solely about metadata -- we also have raw data inside these >>> documents confined to Tables, paragraphs of sentences >> >> Well, sure, but is extracting information directly from the figures or >> tables or text being considered here? I sure would like this to be >> possible. How would it work in an HTML context? > > Each table is a Class. > Each table record is an instance of the Class represented by the table. > Each table field is a property of a Class represented by the table > Each table field value's data type can be used to discern the range of each > Class property. > > Depending on what the sentences and paragraphs are about you can make an RDF > statement per sentence. But to do all this you need to add extra information to the document. Where is the tooling for that? Do you want to make authors enter the markup by hand? I think that some tooling help is needed here. >>> 3. If querying a PDF was marginally simple, I would be demonstrating that >>> using a SPARQL results URL in response to this post >> >> I'm not saying that it is so simple. You do have to find the metadata block >> in the PDF and then look for the /Title, /Author, ... stuff. > > But it could be simple if PDF didn't have the issues I outlined in regards to > extraction technology. Funnily enough, there's a massive opportunity for Adobe > to solve this problem, especially as they've now ventured heavily into cloud > enabling their technologies, If they provide APIs from the cloud, this problem > could become much simpler to address in regards to productive solutions where > PDFs become less of the data silos that they are today. I believe that we have a demonstration from Norman Gray that RDF data can be entered in LaTeX, put in PDF, and then extracted quite easily. The use of LaTeX, with its macro facilities, makes entering some information painless and makes entering other information easy. [More metadata discussion removed.] >>> We want to leverage the productivity and simplicity that AWWW brings to data >>> representation, access, interaction, and integration. >> >> Sure, but the additional costs, if any, on paper authors, reviewers, and >> readers have to be considered. If these costs are eliminated or at least >> minimized then this good is much more likely to be realized. > > With some help from Adobe we can have the best of all worlds here. I am going > to take a look at their latest cloud offerings and associated APIs. It appears to me that no help is needed from Adobe. If you want to have documents that are not only reasonable to produce, review, and read, but that also have useful RDF information emmbedded in them, then I suggest expanding on the approach just put together by Norman Gray. His approach can be used with LaTeX and PDF, meaning that anyone can start using it and their papers would be acceptable at any conference or journal that wants PDF (albeit maybe with some fancy footwork for those venues that rerun authors' sources through LaTeX). It should be easy to make this approach work with any system that produces HTML from LaTeX---all that is required is to augment the HTML generation tools to handle the data insertion primitives. peter
Received on Tuesday, 7 October 2014 01:59:04 UTC