RE: good data Re: [EPFSUG] How to make better diffs and auto-generate the sacred 'four-column working document'?

For further context: Stef, friends and colleagues have sent an Open Letter to Klaus Welle, the Secretary General of the EP:

http://europarl.me/open%20letter%20for%20open%20data%20Welle

Maybe it is time for another.

//Erik
Open Letter for Open Data

Dear Mr Welle,

On behalf of all participants of the EP HACKATHON 2014<http://europarl.me>, I would first of all like to thank you for your email of 21 Jan 2014<http://europarl.me/A2014_247-SG-EN> regarding the functionality of Parltrack and how it makes available data on the work of the European Parliament.

We agree with you that this data is produced on a daily basis by the Parliament itself as a consequence of the work of its committees, its administration and secretariats, and perhaps most importantly, of its voting in plenary.

This rich set of data is of great value, in particular when we analyse, visualise, correlate and systematise it. With modern statistical techniques and visualisation tools we make it possible to "see" democracy at work in ways one couldn't imagine a few years ago.

This "visibility" brings us to the question of the quality of the software, the formats and the standards the Parliament is currently using. As far as we understand, you are personally responsible for ensuring that the Parliament's activities are conducted with the utmost transparency<http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+RULES-EP+20130701+RULE-103+DOC+XML+V0//EN&language=EN&navigationBar=YES>. Therefore we ask you to support our efforts to help the Parliament meet its own statutory requirements in this regard.

In particular, we would be grateful if you could provide full, accurate, timely, openly licensed, open standards-based, machine-readable access to CODICT, ACTES and ITER, and a list of all other databases and their interfaces which you plan to open for public query.

It would already be very helpful if you could, as a first basic step, make data currently locked in PDFs also available in raw text format, or (preferably) XML/Akoma Ntoso<http://www.akomantoso.org>.

We would be delighted to help and advise you further on how to find feasible and realistic solutions to allow the public to profit from the Parliament's data.

We remain at your disposal and thank you for your efforts to make the European Parliament a front-runner in regards to transparency among the EU's institutions!



EP HACKATHON 2014 organizers:

  *   Stefan Marsiske - Parltrack<http://parltrack.euwiki.org/>
  *   Xavier Dutoit - Tech To The People<http://techtothepeople.com/>

EP HACKATHON 2014 participants and supporters:

  *   Joel Purra - DFRI.se<https://dfri.se/>, AT4AM.eu<https://at4am.eu/>
  *   Niels Erik Kaaber Rasmussen - api.epdb.eu<http://api.epdb.eu/>
  *   Matilde Pinamonti
  *   Alexander Mikhailian - EurActiv<http://EurActiv.com/>
  *   Esther Durin
  *   Daniel Lentfer - Democracy International<https://www.democracy-international.org/>
  *   Michal Skop - KohoVolit.eu<http://KohoVolit.eu/>
  *   Martin Virtel - lobbyplag.eu<http://lobbyplag.eu/>, opengov.cat<http://opengov.cat/>
  *   Matti Schneider
  *   Laetitia Veriter
  *   Christian Staat - Université libre de Bruxelles<http://www.ulb.ac.be/>
  *   Adrián Blanco
  *   Juan Elosua
  *   Thomas Bouchet - La Quadrature du Net<https://laquadrature.net/>
  *   Joachim Gola - 4=1 GmbH Hamburg, eu-parlameter.zdf.de<http://eu-parlameter.zdf.de>
  *   Chiara Girardelli
  *   André Rebentisch
  *   Olivier Hoedeman - Corporate Europe Observatory<http://corporateeurope.org/> (CEO)
  *   Mauricio Nascimento - FSFE<https://fsfe.org/> Fellowship Coordinator - BE
  *   Joris Vanhove
  *   Karsten Gerloff - President, Free Software Foundation Europe<https://fsfe.org/>
  *   Geraldine Nethercott - Access Info Europe<http://www.access-info.org/>
  *   Paul Roeland
  *   Thomas Tursics
  *   Friedrich Lindenberg - Open Knowledge Foundation Deutschland e.V.<http://okfn.de/>

Notes:

  *   We're writing a reply to this letter: 2014-01-21 A(2014)247-SG-EN Machine-redeable data request<http://europarl.me/A2014_247-SG-EN>
  *   This was the original question: 2013-10-30 machine readable vote-results<http://europarl.me/2013-10-30_mail_Machine_readable_vote-results>


________________________________________
From: epfsug-request@epfsug.eu [epfsug-request@epfsug.eu] on behalf of stef [s@ctrlc.hu]
Sent: Saturday 8 November 2014 21:45
To: epfsug@epfsug.eu
Subject: Re: good data Re: [EPFSUG] How to make better diffs and auto-generate the sacred 'four-column working document'?

On Sat, Nov 08, 2014 at 07:51:59PM +0100, Andreas Kuckartz wrote:
> The automatic generation of the public datasets can and needs to be
> improved so that basic links are contained in them. That only requires
> to use the data already contained in the internal systems.

1. the reuse of the currently published EP data is difficult:
1.1 the data is in all kind of formats: pdfs, word docs, xml and recently also
json.
1.2 the data is full with typos and other human errors.
1.3 the schema of the data changes all the time.
1.4 some of the data is bad on purpose (e.g. the rapporteur does not publish
things, only last minute)
1.5 did i mention the fun 28 different cultures provide with the diversity of
names they allow (there was a mep, refered to only as "the duke of foo" in
official records, can you tell what the sur and what the family name is of
him?
1.6 downloading all the data is very resource intensive - takes a few hours a
day on a quite powerful and well connected host
1.7 the EP site you don't want to DoS, and sadly some parts of it do actually
crash when you scrape them. the java process behind it crashes and you get
503-s until it is restarted...
1.8 the EP is a system run by and for humans.

2. there's miriads of various needs for different formats, from csv, excel,
json, to rdf. everyone seems to think, his or her usecase is the one that
everyone else will see as the ultimate one.
2.1 different data processing/mining/presentation methods require different
formats to operate more efficiently
2.2 it's awesome that there is a presentation of the data available in rdf
2.3 if you want to do a deviation index, instead of using a format that is
nice for traversing a graph, you want a format, that lends itself to
statistical analysis.
2.4 if you want to preserve the data for historical purposes over centuries,
what is the best format?
2.5 so we actually have some kind of imaginary stack:
in) raw data from various sources in various formats
out) various usecases for presenting/processing the data in all kind of formats.

prosa:
if you want to connect in to out, that's a bit of a combinatoric explosion.
however having some lightweight intermediate format until the input is cleaned
up makes a lot of sense economically. if you compare format overhead, parsing
overhead, and memory consumption, anything xml-based will disqualify at this
amount of data. parltrack does dump its data in a serially readable json
format and also as native mongodb dumps, both can be readily read, either for
post processing, or directly in a db, so you (as a dev/provider) can focus on
the presentation of that data (even as rdf). this is also why i like the
french senate with their postgres dumps, you get the data, and can focus on
your presentation/processing, instead of having to do a full ETL pipeline
before reading the data. of course we can go the XML way, but then that means
only a few with sofisticated tools and huge HW will be able to play with the
data, i rather prefer a simple shell script for some quick transformations.
and i think this would fit the spirit of EPFSUG also more.

3. parltrack currently publishes the data in a unified format in json, that
you can load into whatever format you need to present the underlying data.
3.1 parltrack already does link lots of the various databases, based on EP
internal IDs and heuristics (which do fail, sometimes by their and the datas
very nature)

4. there is lots of data that is yet uneconomic to get (only perhaps through
errorprone scraping)
4.1 there's also lots of data that is yet completely unavailable (e.g. the
inputs to at4am)
4.2 parltrack is happy when it finally becomes obsolete because the EP does
publish all the data in a readily reusable format.
4.3 until then parltrack is happy to accept patches for improved data quality
and new datasources.
4.4 also until then parltrack is happy when people reuse the parltrack data
4.5 furthermore parltrack is happy to reuse other (so far not-existing) reliable
fresh, and free datasets from other providers. if this is in some bloated or
less useful format, the ensuing happyness will be reversely proportional, as the
cost of reusing this data approaches that of scraping it again from the EP.
4.6 parltrack has no ambitions to be a contractor of EP, partrack wants to be
replaced by it (and the EP webinterface is pretty close to achieve that for
certain datasets, but the publication of raw, fresh, open and reusable data
still serve as a reason for existence of parltrack)

5. LoD provides no added value, but costs with eriks wishlist items
that require diffing of documents (the original topic of this thread) or
statistical analysis to identify anomalies and predict trends.
5.1 the added value of LoD is actually the fact that someone links up two or
more datasets. the - some not so - trivial cases are already handled by
parltrack data.
5.2 if linking up is done automatically - due to the bad quality of the EP
data available - will either contain errors, or requires constant human
verification of the automatic linking. thus useful LoD is expensive or you
have students and interns, but that is not very sustainable.
5.3 for cases like "agriculture in all carried amendments" LoD is not
necessary, but can add value when displaying the results to such a query.

is there anyone on this list that wants to take up one of the items from eriks
wishlist? i happy to provide guidance much more than participating in threads
as this.

--
otr fp: https://www.ctrlc.hu/~stef/otr.txt

Received on Tuesday, 11 November 2014 00:01:45 UTC