Re: Introducing Tabula (PDF to CSV conversion tool)

Hatem, this is an extremely interesting tool! Note to everyone: even
though Mozilla was one of the supporters, it works in all browsers.
Or, at least also Chrome ;)

A couple suggestions:
1. In addition to enabling the user to download and copy the selected
table segment, please provide a way (or at least start thinking about
a way) for there to be a permanent/re-usable/reliable URL to the
selected content. The reason is, some of us have RDF conversion
workflows that document the provenance, starting with the download URL
of the source CSV.
2. I can understand how headers present a problem..but it would be
extremely useful to have them working! Maybe you can extract them
first, then associate them with selected table segments on a follow-up
pass. But you'll need to have created a URL for the selected header
cells ;) NOTE: One compromise is to only do COMPLETE tables if the
headers are to be included.
3. Related to the above, you really need to encode provenance (see W3C
PROV) for this to really be useful to people using extracted tabular
data "in anger."

Thanks again for this good work!

John

On Wed, Apr 3, 2013 at 4:25 PM, Hatem Ben Yacoub <hatemben@gmail.com> wrote:
> Hi all,
>
> One of the problems that many Open Government data projects faces is
> the availability of tons of old documents in PDF format, which is not
> open and reusable format. Today, Mozilla announced Tabula, a new tool
> to help liberate tables trapped in PDFs.
>
> The online demo is amazing : http://tabula.nerdpower.org/
>
> To use it simply make a rectangular selection over tables on the PDF
> pages. (Avoid headers)
>
> Sources https://github.com/jazzido/tabula
>
> Official announcement :
> http://source.mozillaopennews.org/en-US/articles/introducing-tabula/
>
>
> Best,
> --
> Eng. Hatem Ben Yacoub
> ICT & eGOV Consultant
> http://hbyconsultancy.com
>
> http://twitter.com/hatem
> http://facebook.com/hatemben
>



-- 
John S. Erickson, Ph.D.
Director, Web Science Operations
Tetherless World Constellation (RPI)
<http://tw.rpi.edu> <olyerickson@gmail.com>
Twitter & Skype: olyerickson

Received on Thursday, 4 April 2013 10:18:47 UTC