Re: Defining "Open" Data (was RE: no F2F3 in 2009 -- Re: Agenda, eGov IG Call, 11 Nov 2009) from Chris Beer on 2009-11-24 (public-egov-ig@w3.org from November 2009)

From: Chris Beer <chris-beer@grapevine.net.au>
Date: Tue, 24 Nov 2009 20:12:38 +1100
To: Brian Gryth <briangryth@gmail.com>
CC: eGovIG IG <public-egov-ig@w3.org>, Joe Carmel <joe.carmel@comcast.net>, Daniel Dietrich <daniel@so36.net>, Jonathan Gray <jonathan.gray@okfn.org>, "Emmanouil Batsis (Manos)" <manos@abiss.gr>, Todd Vincent <todd.vincent@xmllegal.org>, Niklas Lindström <lindstream@gmail.com>, "prof. dr. Tom M. van Engers" <vanengers@uva.nl>, peter.krantz@gmail.com, david osimo <david.osimo@gmail.com>, Jose Manuel Alonso <josema.alonso@fundacionctic.org>, washingtona@acm.org
Message-ID: <4B0BA386.5080009@grapevine.net.au>
Hi all

I've been pondering how to respond to this discussion, and Brian has 
managed to few of those thoughts into words, enough that I now have the 
time to write the rest down :)

I agree that articifial constructs are probably not the way to go when 
it comes to Open Data policy - policy will always be in the hands of the 
policy makers, and despite the best intentions of all groups, that 
policy is probably well along being already defined regardless of the 
opinions we have here, or those that the NGO sector and the public has. 
Hence I would have to agree that the development of guidelines and case 
studies around best practice - the gentle guiding hand so to speak, is 
probably the way to go as far as the policy side goes. We accept that 
the policy will be created, whether we are involved in any way or not, 
and try and assist in the best practices around implementing said 
policy, of which there will be many worldwide, and in various shades of 
grey.

However - the wonderful world of technology - the Data side of the 
equation if you will, is thankfully fairly black and white. And it is 
here that the artificial constructs ARE going to be of benefit. One can 
always look at some data, some 1's and 0's, and say "that is open and 
accessible to all"  or "that is not open and accessible to all". 
Remembering that the format of the data has nothing to do with if or how 
Governments allow access to it.

One the one side, you might have a Government that converts all datasets 
into an Open Standard format that is a guiding beacon unto all who wish 
to mash it, yet allows no access to it under the clause that "closed is 
just another shade of open". Or at the other end of the spectrum, you 
might have a Government that places all datasets on open accessible 
public servers and cries "Have at it people", without converting any of 
the data into a human or machine readable format that is actually useful 
- leaving in some proprietary inhouse format that only those with very 
expensive peices of software can utilise. In this scenario, which 
Government has implemented 'Open Data'?

I guess what I'm saying is that we need to firmly seperate the two 
worlds. Work on firm standards and hard advice on the technology side, 
while staying focused on education and outreach when it comes to the 
policy side.

I'd like to finish on two points. The first is a thought - I believe 
that there exists an opportunity on some level to conduct both 
education/outreach as well as setting a standard through a means akin to 
http://www.scoreinthebox.com/ or similar, however based around Open Data 
- while probably not best delivered via W3C or .gov.* (perceived bias or 
non-neutrality could be an issue), it would provide a benchmarking 
system, which experience shows, always works. It's one thing to see a 
message of "your data is/isn't accessible" as a decision maker, and 
quite another when competition/comparison becomes the driver - "Hey - 
why aren't WE ranked higher then THEM in our Open Data score." By making 
it a open benchmark, it also provides citizen/users a yardstick by which 
to measure their own Govs Open Dataness against others while at the same 
time provides a metric that data.gov.* sites that are just coming into 
being could use as a case study or something to aspire to. I think this 
would best be delivered by a internationally recognised and accepted 
site/organisation in the field of reporting or data, so that it could 
not be tarred with the meddlesome NGO tag, or the "what would they know" 
tag.

The second concerns the artificial construct concept. If this turns out 
to be the direction taken, then I would urge those developing it to use 
a matrice/matrix rather than a flat scale. By firmly seperating the 
openess of the data (technology) from the openess of access (policy) (as 
I've said above) and placing each on one axis, one could develop a 
snapshot rating of overall Openess - from A1 (Open Data and Open Access) 
through to say F6 (Proprietary Format and Closed System).

Cheers

Chris



Brian Gryth wrote:
> Defining "open data", although an artificial construction, may still 
> be helpful.   It would establish a beginning point or at least context 
> of how this group, Open Knowledge Foundation, or others conceptualize 
> the term/idea.  "Open" is a tricky term just like "transparency" or 
> "public record".  The fact is if a law or executive order is enacted 
> to mandate the release of government data as open data (or insert 
> appropriate term) then that term will have to be defined.  (Or enter 
> the lawyers!)  Is a definition critical?  I am not sure it is a show 
> stopper.
>  
> What is more to the point are guidelines.  As Josh points out, rigid 
> constructs are likely to create artificial conflict that is not the 
> intent.  Joe's work and Anne's suggestion are good, but I take pause 
> calling any guidelines best practices or a scale.  I'd rather leave 
> the good and evil arguments to Nietzsche and other philosophers.  If 
> the structure of the guidelines could be made more neutral, it may be 
> helpful.  Placing the discussion in a non-judgmental light might be 
> more beneficial.  Talking about how "open data" is helpful or advances 
> the agencies mission will go along way.  Laying out the pros and cons 
> will also help policy makers and people like me who are promoting 
> these concepts within government. 
>  
> Finally, I wanted to share an interesting post I ran across yesterday 
> that addresses some of this issues, 
> http://osrin.net/2009/10/open-government-data-and-the-great-expectation-gap/
>
> On Mon, Nov 16, 2009 at 2:02 PM, Anne L. Washington 
> <washingtona@acm.org <mailto:washingtona@acm.org>> wrote:
>
>     Lots of great ideas ...from specific examples of new scales to solid
>     definitions of open data.
>
>     I agree w Todd Vincent. We can build on existing definitions. The
>     vision is out there and has been done by many committees and
>     organizations. However, I take Josh's cautionary tale to heart. No
>     need running around pointing fingers and making enemies. The
>     public sector doesn't need accusations. They need advice.
>
>     I am arguing for a scale / scorecard of open-ness. There is
>     already a precedent for this type of hierarchy in evaluating
>     businesses. Software capabilities are evaluated this way.  It
>     would be a familiar tool for anyone in management. From the
>     suggestions in the last few days, it seems we are already moving
>     towards several possibilities.
>
>     Let's see we create a 5 point scale.
>
>     For each point on the scale I'd suggest we start with some generic
>     overall descriptions. Someone can see if they are doing ABC, they
>     are only at 2 on the scale, while if they are doing XYZ, they are
>     scale 5. For simplicity sake so we can get something out the door,
>     we could just list several technologies under each scale.
>
>     Later we could, possibly, dig into the technology by applying that
>     scale to specific technologies like pdfs (as Joe Carmel has done),
>     to data downloads (as the original message did), to searching
>     data, to rights management, to xml, to html pages... whatever. It
>     would be easy to ask for and build use cases underneath each one.
>
>
>     and btw
>     data.gov.uk <http://data.gov.uk/>, with Tim Berners Lee at the
>     helm, has been announced to open in early December 2009.
>     http://news.bbc.co.uk/2/hi/technology/8311627.stm
>
>
>
>     Anne L. Washington
>     Standards work - W3C egov - washingtona@acm.org
>     <mailto:washingtona@acm.org>
>     Academic work - George Washington University
>     http://home.gwu.edu/~annew/ <http://home.gwu.edu/%7Eannew/>
>
>
>     On Fri, 13 Nov 2009, Joe Carmel wrote:
>
>         I also agree that Brian Gryth?s "Access, Rights and Formats /
>         Medium"
>
>         breakdown is a good way to clarify and define best practices
>         in this space
>         and Todd has identified a great set of parameters and criteria
>         to more fully
>         explore and define the meaning behind open data.  I think one
>         of the W3C
>         eGov goals was highlighted by Anne Washignton?s idea that a
>         more detailed
>         description and hierarchy would be very useful.
>
>         Piggy-backing on Anne's idea, I drafted some ideas as a
>         hierarchy based on
>         my experience parsing PDF files to provide sub-document access
>         capability at
>         www.legislink.org <http://www.legislink.org/>.
>
>         http://legislink.wikispaces.com/message/view/home/14870950#toc1
>         (included
>         below)
>
>         This listing is based on a single criteria and different
>         criteria would lead
>         to different conclusions.  For example, if access performance
>         or ADA
>         compliance were used as criteria, different conclusions would
>         likely be
>         drawn.
>
>         In my breakdown, there are three subcategories (Best,
>         Middle-of-the-Road,
>         and Minimal Practices).  Since different file formats offer
>         different
>         capabilities (e.g., anchors, ids, named-dests) that determine
>         re-usability,
>         listing formats alone (e.g., HTML, XML, PDF) is not sufficient
>         to categorize
>         a practice as best, good, or minimal.  For example, if a file
>         is in XML, it
>         still might NOT have metadata, ids, or even internal structure
>         (e.g., a text
>         file surrounded by one set of tags making it equivalent to an
>         ASCII text
>         file).  The use of file format practices makes a format more
>         or less
>         reusable and "open".  Alternatively, worst practices might be
>         considered
>         "open" in the sense that the data is at a minimum available on
>         the web.
>         Publishing scanned historic material is certainly better than
>         not providing
>         the material at all (e.g.,
>         http://clerk.house.gov/member_info/electionInfo/index.html).  If I
>         understand Anne correctly, I think she's trying to get at this
>         sort of
>         hierarchical list and explanation.
>
>         It seems to me that building such hierarchical use case lists
>         would be a
>         good task for the IG or one of the sub-groups we started
>         discussing at the
>         end of our last call.
>
>
>         On a related topic, for those who haven't seen it,
>         http://www.gcn.com/Articles/2009/10/30/Berners-Lee-Semantic-Web.aspx#
>         is an
>         interesting recent article covering an interview with Sir Tim
>         Berners-Lee at
>         the International Semantic Web Conference (Tim Berners-Lee:
>         Machine-readable
>         Web still a ways off).
>
>                "He said that the use of RDF should not require
>         building new
>         systems,
>                or changing the way site administers work, reminiscing
>         about how
>         many
>                of the original Web sites were linked back to legacy
>         mainframe
>         systems.
>                Instead, scripts can be written in Python, Perl or
>         other languages
>         that
>                can convert data in spreadsheets or relational
>         databases into RDF
>         for
>                the end-users. "You will want to leave the social
>         processes in
>         place,
>                leave the technical systems in place," he said. "
>
>         This statement is practical and sounds very much like a call
>         to identify
>         which format conditions are more or less convertible to RDF (e.g.,
>         http://djpowell.net/blog/entries/Atom-RDF.html,
>         http://blogs.sun.com/bblfish/entry/excell_and_rdf).  Maybe
>         this sort of work
>         is already being done as a separate effort at W3C or
>         elsewhere?  If a Python
>         or Perl script could be written to convert formats to RDF, it
>         would be
>         possible to build such conversions as on-the-fly web services
>         or even as a
>         spider.  Given a probable future that includes such services,
>         formats that
>         are currently considered less "open" will likely be viewed
>         differently under
>         at least some circumstances.  I think all of this probably
>         deserves some
>         attention and would go a long way toward helping governments
>         and the public
>         understand the implications of government electronic
>         publication practices
>         for open data.
>
>         Thanks much,
>
>         Joe
>
>         BEST PRACTICES: Direct sub-document access, no file
>         modification needed
>
>         1. XML with ids
>         Direct access to every level is possible (author determined)
>
>         2. well-formed HTML with anchors
>         Direct access to every level is possible (author determined)
>
>         3. PDF with named destinations
>         Direct access to every level is possible (pages are automatic,
>         others are
>         author determined)
>         From
>         http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/pdf_creation_a
>         pis_and_specs/pdfmarkReference.pdf#page=47
>         ?Named destinations may be appended to URLs, following a ?#?
>         character, as
>         in http://www.adobe.com/test.pdf#nameddest=name. The Acrobat
>         viewer displays
>         the part of the PDF file specified in the named destination?
>
>         4. PDF (non-image based)
>         Page access by default
>
>
>         MIDDLE OF THE ROAD: Direct sub-document access only possible
>         with file
>         modification
>         1.. TXT files with consistent formatting
>         Direct access to consistently formatted levels with file
>         modification
>
>         2. XML without ids
>         Direct access to consistently formatted levels with file
>         modification
>         If browsers supported Xpointer within the URL entry, XML and
>         well-formed
>         HTML be in the "best" category
>         (see http://www.w3schools.com/xlink/xpointer_example.asp#)
>
>         3. HTML without anchors
>         Direct access to consistently formatted levels with file
>         modification
>
>
>         WORST PRACTICES: OCR needed or human readers only
>         1. PDF (image based only)
>         2. TIFF
>         3. Proprietary models where the document cannot be viewed in
>         all browsers.
>
>
>
>
>
>
> -- 
> Brian Peltola Gryth
> 715 Logan street
> Denver, CO 80203
> 303-748-5447
> twitter.com/briangryth <http://twitter.com/briangryth>
Received on Tuesday, 24 November 2009 09:12:57 UTC