Re: Use machine-readable standardized data formats / Use non-proprietary data formats from Eric Stephan on 2015-08-16 (public-dwbp-wg@w3.org from August 2015)

From: Eric Stephan <ericphb@gmail.com>
Date: Sun, 16 Aug 2015 06:38:55 -0700
To: Annette Greiner <amgreiner@lbl.gov>
Cc: DWBP WG <public-dwbp-wg@w3.org>, Erik Wilde <dret@berkeley.edu>
Message-ID: <CAMFz4jisLM=hd+QhhMUAvuHzTo0ip8nKHb+h4_qiO-yWfzs7nQ@mail.gmail.com>
Great thoughts Annette,

Comments below....

Eric S.

On Sat, Aug 15, 2015 at 6:25 PM, Annette Greiner <amgreiner@lbl.gov> wrote:

> I think Erik's concern is an important one that we as a group need to
> address.
> It's a crucial part of a broader concern that I have with the current BP
> document in general. My reading of the charter has led me to believe that
> we have two clear duties that our use-case-focused process has not served
> well in addressing.


Agreed, part of the constraint about developing use cases is knowing what
granularity is really needed.  The Resource Discovery for Extreme Scale
Collaboration (rdesc.org) use case: http://www.w3.org/TR/dwbp-ucr/#UC-RDESC.
Doesn't go into hellish detail about how difficult it was extracting
metadata from over 770K datasets to make them searchable on the web.

>From a practical standpoint I think there was a point of saturation and a
cutoff on the number of core requirements we felt we could handle as a
working group.    While I would have preferred more detail this is what we
decided as a working group to go with.    At one point Deirdre provided a
matrix on candidate expanded requirements, the call was made to go with a
core group of requirements that were visible across a number of different
use cases.

We need to address the issues that get in the way of sharing data online,
> for both publishers and reusers of data. For data re-users, the primary
> roadblock that I see is that publishers often put out data that is
> difficult for them to work with for various reasons.


Several in the working group including myself argued this point, but the
majority of the working group felt BP should be oriented to publishers (I
believe this was winter 2015).   From what I recall, the consensus came
when it was agreed that the BP document should be for publishers of the
data and that any discussion of consumption should be handled in the
vocabulary documents.

Perhaps there an opportunity to fortify both vocabulary documents to meet
some shortcomings on reuse, but there is a question of how the BP document
refers to the vocabularies to make publishing and reuse more cohesive.  Is
this a possible F2F topic on the better integration of BP and the
vocabularies?


> Lack of webbiness is one of those reasons. For publishers, the inability
> to monitor usage is the primary issue. I see the BP document and the DQV as
> addressing the main issue for data re-users and the DUV as addressing the
> main issue for publishers. Of course, all the docs are potentially useful
> for all parties, but I think we need to better address the prime motivators.
>
> Again it was previously decided that the BP document is largely data
publisher focused and that consumption should be handled elsewhere.  What
aspects of the DQV and DUV aren't handling usage monitoring?  What
requirements are missing?


> My concern with our process is that it hasn't exposed issues like the one
> about data cleanliness, which I think is fundamental to re-use. I find it
> concerning that at this point in our work we don't already all have an
> understanding of what that issue is. (I mentioned it back in January, but
> somehow we let it go--myself included, so mea culpa as well.) As a
> developer whose job involves creating analysis tools for data on the web, I
> can tell you that dirty data is a key source of frustration for people like
> me. By cleanliness, I mean the extent to which I can use data without
> having to alter it to make it amenable to processing. When making a
> visualization from existing data, the first step is to clean it--to find
> missing values, inconsistencies, and values coded in misleading ways, such
> as 99 instead of "not available" or zero for "no response" that, when
> plotted, will make nonsense of the visualization. I am much, much more
> likely to re-use a dataset that is reasonably clean than one that is not,
> because it simply requires much less work. A clean dataset encourages me to
> trust the publisher, giving me confidence that I won't end up working with
> a messy dataset that I thought would be a clean one.
>
>
Perhaps this is a follow on working group activity or could be handled by
one or more "devil in the details" data community groups?  From the
discovery perspective projects like RDESC have different kinds of needs.


> Use cases don't reveal these sorts of issues very well, because they are
> about determining requirements, what is needed rather than how it should
> best be provided. Fortunately, the members of this group know enough about
> publishing on the web to have extracted some useful ideas despite the
> approach, so we have something that is a little helpful. I think we could
> be even more helpful by considering how we might more directly address what
> makes data sharing on the web difficult for our readers. Maybe we should be
> meeting with developer groups, having a more formal presence at relevant
> conferences, or doing surveys. What other things could we be doing to make
> the BP document more insightful?
>

Because our working group is within the Data Activity, I have always felt
open data is more about broadening the appeal of data on the web thinking
beyond the linked data (semantic web) community.  Imo BP would be a success
if it provided guidance to data publishers not in the linked data community
who could adopt similar principles.

I don't disagree with any of the valid points on data published and re-used
on the web,  I am wondering how some point these points could be tackled in
the vocabularies.




> -Annette
>
> On Aug 15, 2015, at 10:45 AM, Erik Wilde <dret@berkeley.edu> wrote:
>
> > portals like this exist in many countries now, and many of them are
> rather un-webby. giving them a focused and reasonable checklist of things
> they should do to become more webby, and explaining why, would be great
> guidance coming from the W3C and being applicable to many of the
> e-government activities going on nowadays.
> >
> > btw, that's exactly how web data started: for document-driven domains,
> telling people to publish their data in RDF is non-sensical. but that's
> what linked data tells them to do. web data is an attempt to encourage
> people to be webby, without prescribing the metamodel they have to use.
> it's about how to be webby without having to be semwebby.
>
>
>
Received on Sunday, 16 August 2015 13:39:24 UTC