Re: Use machine-readable standardized data formats / Use non-proprietary data formats from Annette Greiner on 2015-08-16 (public-dwbp-wg@w3.org from August 2015)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Sun, 16 Aug 2015 16:38:17 -0700
To: Eric Stephan <ericphb@gmail.com>
Cc: DWBP WG <public-dwbp-wg@w3.org>, Erik Wilde <dret@berkeley.edu>
Message-Id: <6589D241-16DC-4F24-A740-1916524D0676@lbl.gov>
Just to be clear, I am not suggesting that the BP document's audience should be data re-users. I think it should be read by data publishers so that they can publish data that is not problematic for data re-users. So the BP document addresses the main issue that re-users have by speaking to publishers directly.
I fully agree that we should be thinking beyond the semantic web community. Linked Open Data already addresses that. I think our group was created to be more aware of what people really do in this space, so that it would not be dismissed as being pie-in-the-sky.
-Annette

On Aug 16, 2015, at 6:38 AM, Eric Stephan <ericphb@gmail.com> wrote:

> Great thoughts Annette,
> 
> Comments below....
> 
> Eric S.
> 
> On Sat, Aug 15, 2015 at 6:25 PM, Annette Greiner <amgreiner@lbl.gov> wrote:
> I think Erik's concern is an important one that we as a group need to address.
> It's a crucial part of a broader concern that I have with the current BP document in general. My reading of the charter has led me to believe that we have two clear duties that our use-case-focused process has not served well in addressing.
> 
> Agreed, part of the constraint about developing use cases is knowing what granularity is really needed.  The Resource Discovery for Extreme Scale Collaboration (rdesc.org) use case: http://www.w3.org/TR/dwbp-ucr/#UC-RDESC.  Doesn't go into hellish detail about how difficult it was extracting metadata from over 770K datasets to make them searchable on the web. 
> 
> From a practical standpoint I think there was a point of saturation and a cutoff on the number of core requirements we felt we could handle as a working group.    While I would have preferred more detail this is what we decided as a working group to go with.    At one point Deirdre provided a matrix on candidate expanded requirements, the call was made to go with a core group of requirements that were visible across a number of different use cases.  
> 
> We need to address the issues that get in the way of sharing data online, for both publishers and reusers of data. For data re-users, the primary roadblock that I see is that publishers often put out data that is difficult for them to work with for various reasons.
> 
> Several in the working group including myself argued this point, but the majority of the working group felt BP should be oriented to publishers (I believe this was winter 2015).   From what I recall, the consensus came when it was agreed that the BP document should be for publishers of the data and that any discussion of consumption should be handled in the vocabulary documents.
> 
> Perhaps there an opportunity to fortify both vocabulary documents to meet some shortcomings on reuse, but there is a question of how the BP document refers to the vocabularies to make publishing and reuse more cohesive.  Is this a possible F2F topic on the better integration of BP and the vocabularies?
>  
> Lack of webbiness is one of those reasons. For publishers, the inability to monitor usage is the primary issue. I see the BP document and the DQV as addressing the main issue for data re-users and the DUV as addressing the main issue for publishers. Of course, all the docs are potentially useful for all parties, but I think we need to better address the prime motivators.
> 
> Again it was previously decided that the BP document is largely data publisher focused and that consumption should be handled elsewhere.  What aspects of the DQV and DUV aren't handling usage monitoring?  What requirements are missing?  
>  
> My concern with our process is that it hasn't exposed issues like the one about data cleanliness, which I think is fundamental to re-use. I find it concerning that at this point in our work we don't already all have an understanding of what that issue is. (I mentioned it back in January, but somehow we let it go--myself included, so mea culpa as well.) As a developer whose job involves creating analysis tools for data on the web, I can tell you that dirty data is a key source of frustration for people like me. By cleanliness, I mean the extent to which I can use data without having to alter it to make it amenable to processing. When making a visualization from existing data, the first step is to clean it--to find missing values, inconsistencies, and values coded in misleading ways, such as 99 instead of "not available" or zero for "no response" that, when plotted, will make nonsense of the visualization. I am much, much more likely to re-use a dataset that is reasonably clean than one that is not, because it simply requires much less work. A clean dataset encourages me to trust the publisher, giving me confidence that I won't end up working with a messy dataset that I thought would be a clean one.
> 
> 
> Perhaps this is a follow on working group activity or could be handled by one or more "devil in the details" data community groups?  From the discovery perspective projects like RDESC have different kinds of needs.  
>  
> Use cases don't reveal these sorts of issues very well, because they are about determining requirements, what is needed rather than how it should best be provided. Fortunately, the members of this group know enough about publishing on the web to have extracted some useful ideas despite the approach, so we have something that is a little helpful. I think we could be even more helpful by considering how we might more directly address what makes data sharing on the web difficult for our readers. Maybe we should be meeting with developer groups, having a more formal presence at relevant conferences, or doing surveys. What other things could we be doing to make the BP document more insightful?
> 
> Because our working group is within the Data Activity, I have always felt open data is more about broadening the appeal of data on the web thinking beyond the linked data (semantic web) community.  Imo BP would be a success if it provided guidance to data publishers not in the linked data community who could adopt similar principles.
> 
> I don't disagree with any of the valid points on data published and re-used on the web,  I am wondering how some point these points could be tackled in the vocabularies.
> 
> 
>  
> -Annette
> 
> On Aug 15, 2015, at 10:45 AM, Erik Wilde <dret@berkeley.edu> wrote:
> 
> > portals like this exist in many countries now, and many of them are rather un-webby. giving them a focused and reasonable checklist of things they should do to become more webby, and explaining why, would be great guidance coming from the W3C and being applicable to many of the e-government activities going on nowadays.
> >
> > btw, that's exactly how web data started: for document-driven domains, telling people to publish their data in RDF is non-sensical. but that's what linked data tells them to do. web data is an attempt to encourage people to be webby, without prescribing the metamodel they have to use. it's about how to be webby without having to be semwebby.
> 
> 
>
Received on Sunday, 16 August 2015 23:38:54 UTC