W3C home > Mailing lists > Public > public-egov-ig@w3.org > April 2013

RE: Introducing Tabula (PDF to CSV conversion tool)

From: Owen Ambur <Owen.Ambur@verizon.net>
Date: Thu, 04 Apr 2013 12:03:50 -0400
To: "'Gannon Dick'" <gannon_dick@yahoo.com>, <paoladimaio10@googlemail.com>, "'John Erickson'" <olyerickson@gmail.com>
Cc: "'Hatem Ben Yacoub'" <hatemben@gmail.com>, "'eGov IG \(Public\)'" <public-egov-ig@w3.org>
Message-id: <002b01ce314e$00085800$00190800$@Ambur@verizon.net>
Gannon, yes, of course, I am interested in seeing strategic and performance
plans and reports rendered in open, standard, machine-readable StratML
format whenever possible.  It would be much better if the original,
authoritative sources were in StratML (XML) format so that PDF and other
renditions could automatically be rendered therefrom.  However, to the
degree that may not occur, it will be good to see how far tools like this
can take us in automating an otherwise backward process.




From: Gannon Dick [mailto:gannon_dick@yahoo.com] 
Sent: Thursday, April 04, 2013 10:20 AM
To: paoladimaio10@googlemail.com; John Erickson
Cc: Hatem Ben Yacoub; eGov IG (Public)
Subject: Re: Introducing Tabula (PDF to CSV conversion tool)


A StratML wrapper would make a lot of sense too, I think.  The XFORMS
construction methods are already largely in place.  CSV imports could be
aggregated and marked up and classified in a more targeted way, but not
preclude conversion to RDF at a later time.  The Journal Publishing Suite
(NIH) as well as various LOC citation schemes, MADS, MODS, etc. use this
strategy.  Owen ?





From: Paola Di Maio <paola.dimaio@gmail.com>
To: John Erickson <olyerickson@gmail.com> 
Cc: Hatem Ben Yacoub <hatemben@gmail.com>; eGov IG (Public)
Sent: Thursday, April 4, 2013 5:54 AM
Subject: Re: Introducing Tabula (PDF to CSV conversion tool)

Indeed looks good balance of simplicity and useful functionality, nice 


and reminds me of the 'tabulator' concept a bit more trimmed


Wonder why there is no conversion to RDF?  can we not also have a CSV to RDF

would that not make sense?







On Thu, Apr 4, 2013 at 3:48 PM, John Erickson <olyerickson@gmail.com> wrote:

Hatem, this is an extremely interesting tool! Note to everyone: even
though Mozilla was one of the supporters, it works in all browsers.
Or, at least also Chrome ;)

A couple suggestions:
1. In addition to enabling the user to download and copy the selected
table segment, please provide a way (or at least start thinking about
a way) for there to be a permanent/re-usable/reliable URL to the
selected content. The reason is, some of us have RDF conversion
workflows that document the provenance, starting with the download URL
of the source CSV.
2. I can understand how headers present a problem..but it would be
extremely useful to have them working! Maybe you can extract them
first, then associate them with selected table segments on a follow-up
pass. But you'll need to have created a URL for the selected header
cells ;) NOTE: One compromise is to only do COMPLETE tables if the
headers are to be included.
3. Related to the above, you really need to encode provenance (see W3C
PROV) for this to really be useful to people using extracted tabular
data "in anger."

Thanks again for this good work!


On Wed, Apr 3, 2013 at 4:25 PM, Hatem Ben Yacoub <hatemben@gmail.com> wrote:
> Hi all,
> One of the problems that many Open Government data projects faces is
> the availability of tons of old documents in PDF format, which is not
> open and reusable format. Today, Mozilla announced Tabula, a new tool
> to help liberate tables trapped in PDFs.
> The online demo is amazing : http://tabula.nerdpower.org/
> To use it simply make a rectangular selection over tables on the PDF
> pages. (Avoid headers)
> Sources https://github.com/jazzido/tabula
> Official announcement :
> http://source.mozillaopennews.org/en-US/articles/introducing-tabula/
> Best,
> --
> Eng. Hatem Ben Yacoub
> ICT & eGOV Consultant
> http://hbyconsultancy.com
> http://twitter.com/hatem
> http://facebook.com/hatemben

John S. Erickson, Ph.D.
Director, Web Science Operations
Tetherless World Constellation (RPI)
<http://tw.rpi.edu> <olyerickson@gmail.com>
Twitter & Skype: olyerickson


Received on Thursday, 4 April 2013 16:05:05 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:00:51 UTC