Re: Use machine-readable standardized data formats / Use non-proprietary data formats

I think Erik's concern is an important one that we as a group need to address.
It's a crucial part of a broader concern that I have with the current BP document in general. My reading of the charter has led me to believe that we have two clear duties that our use-case-focused process has not served well in addressing. We need to address the issues that get in the way of sharing data online, for both publishers and reusers of data. For data re-users, the primary roadblock that I see is that publishers often put out data that is difficult for them to work with for various reasons. Lack of webbiness is one of those reasons. For publishers, the inability to monitor usage is the primary issue. I see the BP document and the DQV as addressing the main issue for data re-users and the DUV as addressing the main issue for publishers. Of course, all the docs are potentially useful for all parties, but I think we need to better address the prime motivators. 

My concern with our process is that it hasn't exposed issues like the one about data cleanliness, which I think is fundamental to re-use. I find it concerning that at this point in our work we don't already all have an understanding of what that issue is. (I mentioned it back in January, but somehow we let it go--myself included, so mea culpa as well.) As a developer whose job involves creating analysis tools for data on the web, I can tell you that dirty data is a key source of frustration for people like me. By cleanliness, I mean the extent to which I can use data without having to alter it to make it amenable to processing. When making a visualization from existing data, the first step is to clean it--to find missing values, inconsistencies, and values coded in misleading ways, such as 99 instead of "not available" or zero for "no response" that, when plotted, will make nonsense of the visualization. I am much, much more likely to re-use a dataset that is reasonably clean than one that is not, because it simply requires much less work. A clean dataset encourages me to trust the publisher, giving me confidence that I won't end up working with a messy dataset that I thought would be a clean one. 

Use cases don't reveal these sorts of issues very well, because they are about determining requirements, what is needed rather than how it should best be provided. Fortunately, the members of this group know enough about publishing on the web to have extracted some useful ideas despite the approach, so we have something that is a little helpful. I think we could be even more helpful by considering how we might more directly address what makes data sharing on the web difficult for our readers. Maybe we should be meeting with developer groups, having a more formal presence at relevant conferences, or doing surveys. What other things could we be doing to make the BP document more insightful?
-Annette

On Aug 15, 2015, at 10:45 AM, Erik Wilde <dret@berkeley.edu> wrote:

> portals like this exist in many countries now, and many of them are rather un-webby. giving them a focused and reasonable checklist of things they should do to become more webby, and explaining why, would be great guidance coming from the W3C and being applicable to many of the e-government activities going on nowadays.
> 
> btw, that's exactly how web data started: for document-driven domains, telling people to publish their data in RDF is non-sensical. but that's what linked data tells them to do. web data is an attempt to encourage people to be webby, without prescribing the metamodel they have to use. it's about how to be webby without having to be semwebby.

Received on Sunday, 16 August 2015 01:25:50 UTC