W3C home > Mailing lists > Public > public-opengov@w3.org > July 2013

Re: Introductions

From: James McKinney <james@opennorth.ca>
Date: Tue, 2 Jul 2013 11:26:12 -0400
Cc: public-opengov@w3.org
Message-Id: <08772CDC-69F8-424F-9B6E-82CD7ADDA504@opennorth.ca>
To: Eunjeong Lucy Park <lucypark@popong.com>
Hi Lucy,

Thanks for joining this group!

To handle PDF, the Sunlight Labs list would likely have some PDF-parsing veterans who can give you more advice: https://groups.google.com/forum/#!forum/sunlightlabs I can share a bit of advice from my experience.

- If the PDF pages are images (for example, if it is a scanned document), then you will have a much harder time, because you will need to OCR the image first. I'm not sure how good OCR is for non-Latin characters, though. Tesseract is one of the best, free OCR software.

- If the PDF pages are not images, I found it easiest to convert the PDFt to text, using pdftotext: http://www.foolabs.com/xpdf/download.html I would then use regular expressions to parse the document (not the best, but it works most of the time).

- Instead of using regular expressions, you can use a library that reads the instructions in the PDF file. A PDF file is basically made up of instructions like "print this text", "move the cursor to the footer", "print this text", etc. If those instructions print text in a predictable order, this strategy may work better than regular expressions. By using a library to read the PDF instructions, you can also more easily pick out parts o the document with bold or large text, like headings. Each programming language will have its own PDF reader library.

If you have any feedback on the Popolo specification, please share your comments with the list. We are eager to improve it and clarify it.

Best,

James

On 2013-07-01, at 11:54 AM, Eunjeong Lucy Park wrote:

> Dear All,
> 
> 
> Hi, I'm Lucy Park, a data analyst at Team POPONG (http://en.popong.com) -- a nonprofit, nonpartisan group at Seoul, which aims to provide data in open formats for Korean legislative data.
> 
> During the past few months, we've collected data from various government sources (http://en.popong.com/sources), and currently obtain data for approx. 12,000 politicians (candidates of elections for the past 60 years), and 46,000 bill texts of 20 years' worth and much more PDF documents (that should probably be OCR-ed).
> We had several difficulties on the way, including:
> 
> 1. Irregular structures and dispersion of data and in government websites.
> 2. Machine *un*readable formats: PDF, HWP (a format created by a software named "Hangul", which is extensively used by the government), ...
> 3. Different people with the same names: Most Korean names consist of only three syllables, and share family names.
> 4. Multilingual texts: Korean bills texts are a mixture of Korean, English and Chinese characters.
> 5. Encodings: Encodings should be detected and converted to Unicode because otherwise Korean characters cannot be read in many cases.
> 6.. Internationalization (i18n): Since Korean in unreadable to many people outside the country.
> 
> We've tackled or are used to many of these problems (1, 3, 4, 5), but still are having a hard time with some others (6, and especially 2).
> My team would like to communicate with other organizations and individuals regarding such topics at anytime.
> 
> We've also opened a website ten days ago containing information based on the data above, named "Pokr" (http://en.pokr.kr/). (Sources: http://github.com/teampopong/pokr)
> While we continuously work on improving the service, we would also like to provide the raw data to the public.
> 
> This is why we are especially interested in global standards for opening government data.
> My team was pointed to Popolo and this mailing list during a chat with Sunlight Foundation, and soon became awed by this project.
> We hope to exchange many experiences with you, and are interested in contributing to build a robust, global-wide government data specification.
> 
> 
> Thanks,
> Lucy Park
> 


Received on Tuesday, 2 July 2013 15:26:47 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:38:54 UTC