- From: Joe Carmel <joe.carmel@comcast.net>
- Date: Fri, 13 Nov 2009 13:45:41 -0500
- To: "'Daniel Dietrich'" <daniel@so36.net>, "'Brian Gryth'" <briangryth@gmail.com>
- Cc: "'Jonathan Gray'" <jonathan.gray@okfn.org>, "'eGovIG IG'" <public-egov-ig@w3.org>, "'Emmanouil Batsis \(Manos\)'" <manos@abiss.gr>, "'Todd Vincent'" <todd.vincent@xmllegal.org>, 'Niklas Lindström' <lindstream@gmail.com>, "'prof. dr. Tom M. van Engers'" <vanengers@uva.nl>, <peter.krantz@gmail.com>, "'david osimo'" <david.osimo@gmail.com>, "'Jose Manuel Alonso'" <josema.alonso@fundacionctic.org>, <washingtona@acm.org>
I also agree that Brian Gryth’s "Access, Rights and Formats / Medium" breakdown is a good way to clarify and define best practices in this space and Todd has identified a great set of parameters and criteria to more fully explore and define the meaning behind open data. I think one of the W3C eGov goals was highlighted by Anne Washignton’s idea that a more detailed description and hierarchy would be very useful. Piggy-backing on Anne's idea, I drafted some ideas as a hierarchy based on my experience parsing PDF files to provide sub-document access capability at www.legislink.org. http://legislink.wikispaces.com/message/view/home/14870950#toc1 (included below) This listing is based on a single criteria and different criteria would lead to different conclusions. For example, if access performance or ADA compliance were used as criteria, different conclusions would likely be drawn. In my breakdown, there are three subcategories (Best, Middle-of-the-Road, and Minimal Practices). Since different file formats offer different capabilities (e.g., anchors, ids, named-dests) that determine re-usability, listing formats alone (e.g., HTML, XML, PDF) is not sufficient to categorize a practice as best, good, or minimal. For example, if a file is in XML, it still might NOT have metadata, ids, or even internal structure (e.g., a text file surrounded by one set of tags making it equivalent to an ASCII text file). The use of file format practices makes a format more or less reusable and "open". Alternatively, worst practices might be considered "open" in the sense that the data is at a minimum available on the web. Publishing scanned historic material is certainly better than not providing the material at all (e.g., http://clerk.house.gov/member_info/electionInfo/index.html). If I understand Anne correctly, I think she's trying to get at this sort of hierarchical list and explanation. It seems to me that building such hierarchical use case lists would be a good task for the IG or one of the sub-groups we started discussing at the end of our last call. On a related topic, for those who haven't seen it, http://www.gcn.com/Articles/2009/10/30/Berners-Lee-Semantic-Web.aspx# is an interesting recent article covering an interview with Sir Tim Berners-Lee at the International Semantic Web Conference (Tim Berners-Lee: Machine-readable Web still a ways off). "He said that the use of RDF should not require building new systems, or changing the way site administers work, reminiscing about how many of the original Web sites were linked back to legacy mainframe systems. Instead, scripts can be written in Python, Perl or other languages that can convert data in spreadsheets or relational databases into RDF for the end-users. "You will want to leave the social processes in place, leave the technical systems in place," he said. " This statement is practical and sounds very much like a call to identify which format conditions are more or less convertible to RDF (e.g., http://djpowell.net/blog/entries/Atom-RDF.html, http://blogs.sun.com/bblfish/entry/excell_and_rdf). Maybe this sort of work is already being done as a separate effort at W3C or elsewhere? If a Python or Perl script could be written to convert formats to RDF, it would be possible to build such conversions as on-the-fly web services or even as a spider. Given a probable future that includes such services, formats that are currently considered less "open" will likely be viewed differently under at least some circumstances. I think all of this probably deserves some attention and would go a long way toward helping governments and the public understand the implications of government electronic publication practices for open data. Thanks much, Joe BEST PRACTICES: Direct sub-document access, no file modification needed 1. XML with ids Direct access to every level is possible (author determined) 2. well-formed HTML with anchors Direct access to every level is possible (author determined) 3. PDF with named destinations Direct access to every level is possible (pages are automatic, others are author determined) From http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/pdf_creation_a pis_and_specs/pdfmarkReference.pdf#page=47 “Named destinations may be appended to URLs, following a “#” character, as in http://www.adobe.com/test.pdf#nameddest=name. The Acrobat viewer displays the part of the PDF file specified in the named destination” 4. PDF (non-image based) Page access by default MIDDLE OF THE ROAD: Direct sub-document access only possible with file modification 1.. TXT files with consistent formatting Direct access to consistently formatted levels with file modification 2. XML without ids Direct access to consistently formatted levels with file modification If browsers supported Xpointer within the URL entry, XML and well-formed HTML be in the "best" category (see http://www.w3schools.com/xlink/xpointer_example.asp#) 3. HTML without anchors Direct access to consistently formatted levels with file modification WORST PRACTICES: OCR needed or human readers only 1. PDF (image based only) 2. TIFF 3. Proprietary models where the document cannot be viewed in all browsers.
Received on Friday, 13 November 2009 18:46:25 UTC