Re: Introducing Tabula (PDF to CSV conversion tool) from Gannon Dick on 2013-04-04 (public-egov-ig@w3.org from April 2013)

From: Gannon Dick <gannon_dick@yahoo.com>
Date: Thu, 4 Apr 2013 07:19:59 -0700 (PDT)
To: "paoladimaio10@googlemail.com" <paoladimaio10@googlemail.com>, John Erickson <olyerickson@gmail.com>
Cc: Hatem Ben Yacoub <hatemben@gmail.com>, "eGov IG \(Public\)" <public-egov-ig@w3.org>
Message-ID: <1365085199.31469.YahooMailNeo@web122903.mail.ne1.yahoo.com>

A StratML wrapper would make a lot of sense too, I think.  The XFORMS construction methods are already largely in place.  CSV imports could be aggregated and marked up and classified in a more targeted way, but not preclude conversion to RDF at a later time.  The Journal Publishing Suite (NIH) as well as various LOC citation schemes, MADS, MODS, etc. use this strategy.  Owen ?

 




________________________________
 From: Paola Di Maio <paola.dimaio@gmail.com>
To: John Erickson <olyerickson@gmail.com> 
Cc: Hatem Ben Yacoub <hatemben@gmail.com>; eGov IG (Public) <public-egov-ig@w3.org> 
Sent: Thursday, April 4, 2013 5:54 AM
Subject: Re: Introducing Tabula (PDF to CSV conversion tool)
 

Indeed looks good balance of simplicity and useful functionality, nice 

and reminds me of the 'tabulator' concept a bit more trimmed

Wonder why there is no conversion to RDF?  can we not also have a CSV to RDF button?
would that not make sense?




PDM




On Thu, Apr 4, 2013 at 3:48 PM, John Erickson <olyerickson@gmail.com> wrote:

Hatem, this is an extremely interesting tool! Note to everyone: even
>though Mozilla was one of the supporters, it works in all browsers.
>Or, at least also Chrome ;)
>
>A couple suggestions:
>1. In addition to enabling the user to download and copy the selected
>table segment, please provide a way (or at least start thinking about
>a way) for there to be a permanent/re-usable/reliable URL to the
>selected content. The reason is, some of us have RDF conversion
>workflows that document the provenance, starting with the download URL
>of the source CSV.
>2. I can understand how headers present a problem..but it would be
>extremely useful to have them working! Maybe you can extract them
>first, then associate them with selected table segments on a follow-up
>pass. But you'll need to have created a URL for the selected header
>cells ;) NOTE: One compromise is to only do COMPLETE tables if the
>headers are to be included.
>3. Related to the above, you really need to encode provenance (see W3C
>PROV) for this to really be useful to people using extracted tabular
>data "in anger."
>
>Thanks again for this good work!
>
>John
>
>
>On Wed, Apr 3, 2013 at 4:25 PM, Hatem Ben Yacoub <hatemben@gmail.com> wrote:
>> Hi all,
>>
>> One of the problems that many Open Government data projects faces is
>> the availability of tons of old documents in PDF format, which is not
>> open and reusable format. Today, Mozilla announced Tabula, a new tool
>> to help liberate tables trapped in PDFs.
>>
>> The online demo is amazing : http://tabula.nerdpower.org/
>>
>> To use it simply make a rectangular selection over tables on the PDF
>> pages. (Avoid headers)
>>
>> Sources https://github.com/jazzido/tabula
>>
>> Official announcement :
>> http://source.mozillaopennews.org/en-US/articles/introducing-tabula/
>>
>>
>> Best,
>> --
>> Eng. Hatem Ben Yacoub
>> ICT & eGOV Consultant
>> http://hbyconsultancy.com
>>
>> http://twitter.com/hatem
>> http://facebook.com/hatemben
>>
>
>
>
>--
>John S. Erickson, Ph.D.
>Director, Web Science Operations
>Tetherless World Constellation (RPI)
><http://tw.rpi.edu> <olyerickson@gmail.com>
>Twitter & Skype: olyerickson
>
>

Received on Thursday, 4 April 2013 14:20:39 UTC