RE: Defining "Open" Data (was RE: no F2F3 in 2009 -- Re: Agenda, eGov IG Call, 11 Nov 2009)

I also agree that Brian Gryth’s "Access, Rights and Formats / Medium"
breakdown is a good way to clarify and define best practices in this space
and Todd has identified a great set of parameters and criteria to more fully
explore and define the meaning behind open data.  I think one of the W3C
eGov goals was highlighted by Anne Washignton’s idea that a more detailed
description and hierarchy would be very useful.  

Piggy-backing on Anne's idea, I drafted some ideas as a hierarchy based on
my experience parsing PDF files to provide sub-document access capability at
www.legislink.org. 

 http://legislink.wikispaces.com/message/view/home/14870950#toc1 (included
below)

This listing is based on a single criteria and different criteria would lead
to different conclusions.  For example, if access performance or ADA
compliance were used as criteria, different conclusions would likely be
drawn.

In my breakdown, there are three subcategories (Best, Middle-of-the-Road,
and Minimal Practices).  Since different file formats offer different
capabilities (e.g., anchors, ids, named-dests) that determine re-usability,
listing formats alone (e.g., HTML, XML, PDF) is not sufficient to categorize
a practice as best, good, or minimal.  For example, if a file is in XML, it
still might NOT have metadata, ids, or even internal structure (e.g., a text
file surrounded by one set of tags making it equivalent to an ASCII text
file).  The use of file format practices makes a format more or less
reusable and "open".  Alternatively, worst practices might be considered
"open" in the sense that the data is at a minimum available on the web.
Publishing scanned historic material is certainly better than not providing
the material at all (e.g.,
http://clerk.house.gov/member_info/electionInfo/index.html).  If I
understand Anne correctly, I think she's trying to get at this sort of
hierarchical list and explanation.

It seems to me that building such hierarchical use case lists would be a
good task for the IG or one of the sub-groups we started discussing at the
end of our last call.


On a related topic, for those who haven't seen it,
http://www.gcn.com/Articles/2009/10/30/Berners-Lee-Semantic-Web.aspx# is an
interesting recent article covering an interview with Sir Tim Berners-Lee at
the International Semantic Web Conference (Tim Berners-Lee: Machine-readable
Web still a ways off).

	"He said that the use of RDF should not require building new
systems, 
	or changing the way site administers work, reminiscing about how
many 
	of the original Web sites were linked back to legacy mainframe
systems. 
	Instead, scripts can be written in Python, Perl or other languages
that 
	can convert data in spreadsheets or relational databases into RDF
for 
	the end-users. "You will want to leave the social processes in
place, 
	leave the technical systems in place," he said. "

This statement is practical and sounds very much like a call to identify
which format conditions are more or less convertible to RDF (e.g.,
http://djpowell.net/blog/entries/Atom-RDF.html,
http://blogs.sun.com/bblfish/entry/excell_and_rdf).  Maybe this sort of work
is already being done as a separate effort at W3C or elsewhere?  If a Python
or Perl script could be written to convert formats to RDF, it would be
possible to build such conversions as on-the-fly web services or even as a
spider.  Given a probable future that includes such services, formats that
are currently considered less "open" will likely be viewed differently under
at least some circumstances.  I think all of this probably deserves some
attention and would go a long way toward helping governments and the public
understand the implications of government electronic publication practices
for open data.

Thanks much, 

Joe

BEST PRACTICES: Direct sub-document access, no file modification needed

1. XML with ids
Direct access to every level is possible (author determined)

2. well-formed HTML with anchors
Direct access to every level is possible (author determined)

3. PDF with named destinations
Direct access to every level is possible (pages are automatic, others are
author determined)
From
http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/pdf_creation_a
pis_and_specs/pdfmarkReference.pdf#page=47
“Named destinations may be appended to URLs, following a “#” character, as
in http://www.adobe.com/test.pdf#nameddest=name. The Acrobat viewer displays
the part of the PDF file specified in the named destination”

4. PDF (non-image based)
Page access by default


MIDDLE OF THE ROAD: Direct sub-document access only possible with file
modification
1.. TXT files with consistent formatting 
Direct access to consistently formatted levels with file modification

2. XML without ids
Direct access to consistently formatted levels with file modification
If browsers supported Xpointer within the URL entry, XML and well-formed
HTML be in the "best" category
(see http://www.w3schools.com/xlink/xpointer_example.asp#)

3. HTML without anchors
Direct access to consistently formatted levels with file modification


WORST PRACTICES: OCR needed or human readers only
1. PDF (image based only)
2. TIFF
3. Proprietary models where the document cannot be viewed in all browsers.

Received on Friday, 13 November 2009 18:46:25 UTC