- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Mon, 06 Oct 2014 21:19:56 -0400
- To: "semantic-web@w3.org" <semantic-web@w3.org>
- Message-ID: <54333FBC.2020303@openlinksw.com>
On 10/6/14 2:49 PM, Peter F. Patel-Schneider wrote: > > > On 10/06/2014 11:03 AM, Kingsley Idehen wrote: >> On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote: >>> It's not hard to query PDFs with SPARQL. All you have to do is >>> extract the >>> metadata from the document and turn it into RDF, if needed. Lots of >>> programs >>> extract and display this metadata already. >> >> Peter, >> >> Having had 200+ (some-non-rdf-doc} to RDF document transformers built >> under my >> direct guidance, there are issues with your claim above: > > Huh? Every single PDF reader that I use can extract the PDF metadata > and display it. Again, this isn't about metadata. > The metadata that I see in PDF documents uses a core set of properties > that are easy to transform into RDF. Metadata isn't the issue at hand. > Of course, this core set is very small (title, author, and a few other > things) so you don't get all that much out of the core set. See my comments above > >> >> 1. The extractors are platform specific -- AWWW is about platform >> agnosticism >> (I don't want to mandate an OS for experiencing the power of Linked >> Open Data >> transformers / rdfizers) > > Well, the extractors would be specific to PDF, but that's hardly > surprising, I think. > >> 2. It isn't solely about metadata -- we also have raw data inside these >> documents confined to Tables, paragraphs of sentences > > Well, sure, but is extracting information directly from the figures or > tables or text being considered here? I sure would like this to be > possible. How would it work in an HTML context? Each table is a Class. Each table record is an instance of the Class represented by the table. Each table field is a property of a Class represented by the table Each table field value's data type can be used to discern the range of each Class property. Depending on what the sentences and paragraphs are about you can make an RDF statement per sentence. > >> 3. If querying a PDF was marginally simple, I would be demonstrating >> that >> using a SPARQL results URL in response to this post > > I'm not saying that it is so simple. You do have to find the metadata > block in the PDF and then look for the /Title, /Author, ... stuff. But it could be simple if PDF didn't have the issues I outlined in regards to extraction technology. Funnily enough, there's a massive opportunity for Adobe to solve this problem, especially as they've now ventured heavily into cloud enabling their technologies, If they provide APIs from the cloud, this problem could become much simpler to address in regards to productive solutions where PDFs become less of the data silos that they are today. > >> Possible != Simple and Productive. > > Yes, but there are lots of tools that display PDF metadata, so there > are some who believe that the benefit is greater than the cost. Metadata isn't the fundamental quest here. > >> We want to leverage the productivity and simplicity that AWWW brings >> to data >> representation, access, interaction, and integration. > > Sure, but the additional costs, if any, on paper authors, reviewers, > and readers have to be considered. If these costs are eliminated or > at least minimized then this good is much more likely to be realized. With some help from Adobe we can have the best of all worlds here. I am going to take a look at their latest cloud offerings and associated APIs. > > peter > > > > > -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog 1: http://kidehen.blogspot.com Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen Twitter Profile: https://twitter.com/kidehen Google+ Profile: https://plus.google.com/+KingsleyIdehen/about LinkedIn Profile: http://www.linkedin.com/in/kidehen Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
Attachments
- application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Tuesday, 7 October 2014 01:20:18 UTC