Re: Use machine-readable standardized data formats / Use non-proprietary data formats from Bernadette Farias Lóscio on 2015-08-20 (public-dwbp-wg@w3.org from August 2015)

From: Bernadette Farias Lóscio <bfl@cin.ufpe.br>
Date: Thu, 20 Aug 2015 19:24:00 -0300
To: Annette Greiner <amgreiner@lbl.gov>
Cc: Eric Stephan <ericphb@gmail.com>, DWBP WG <public-dwbp-wg@w3.org>, Erik Wilde <dret@berkeley.edu>
Message-ID: <CANx1PzwuUtcuYTbxGvjJCFrx5GtgW-LrvjbkSvj3y9wKJLk69Q@mail.gmail.com>
Hi all,

This has been a very interesting discussion!

Concerning the scope, I'd like to remember that some decisions were made
about the scope of our BP and are available on the wiki [1]. Besides, we
already agreed to use the dataset definition of DCAT in our document. I
think this decision was made during the F2F at Santa Clara (I'm not sure).

In the 2nd draft, we also included a section [2] to explain the meaning of
"Publishing Data on the Web". IMO this section may be improved with some of
the thoughts of this thread. So, I suggest that we work together to have
something more "concrete" to include in the document. In order to
facilitate our work, I created a wiki page [3] with some initial thoughts.

It is really important that we have finished this section before our next
F2F.

Please feel free to improve the text as well as to include your thoughts
about this. We are looking forward to your contributions.

Thanks!
Bernadette

[1] https://www.w3.org/2013/dwbp/wiki/Scope#Proposed_Scope_April_2015
[2] http://www.w3.org/TR/dwbp/#context
[3] https://www.w3.org/2013/dwbp/wiki/Context

2015-08-16 20:38 GMT-03:00 Annette Greiner <amgreiner@lbl.gov>:

> Just to be clear, I am not suggesting that the BP document's audience
> should be data re-users. I think it should be read by data publishers so
> that they can publish data that is not problematic for data re-users. So
> the BP document addresses the main issue that re-users have by speaking to
> publishers directly.
> I fully agree that we should be thinking beyond the semantic web
> community. Linked Open Data already addresses that. I think our group was
> created to be more aware of what people really do in this space, so that it
> would not be dismissed as being pie-in-the-sky.
> -Annette
>
> On Aug 16, 2015, at 6:38 AM, Eric Stephan <ericphb@gmail.com> wrote:
>
> Great thoughts Annette,
>
> Comments below....
>
> Eric S.
>
> On Sat, Aug 15, 2015 at 6:25 PM, Annette Greiner <amgreiner@lbl.gov>
> wrote:
>
>> I think Erik's concern is an important one that we as a group need to
>> address.
>> It's a crucial part of a broader concern that I have with the current BP
>> document in general. My reading of the charter has led me to believe that
>> we have two clear duties that our use-case-focused process has not served
>> well in addressing.
>
>
> Agreed, part of the constraint about developing use cases is knowing what
> granularity is really needed.  The Resource Discovery for Extreme Scale
> Collaboration (rdesc.org) use case:
> http://www.w3.org/TR/dwbp-ucr/#UC-RDESC.  Doesn't go into hellish detail
> about how difficult it was extracting metadata from over 770K datasets to
> make them searchable on the web.
>
> From a practical standpoint I think there was a point of saturation and a
> cutoff on the number of core requirements we felt we could handle as a
> working group.    While I would have preferred more detail this is what we
> decided as a working group to go with.    At one point Deirdre provided a
> matrix on candidate expanded requirements, the call was made to go with a
> core group of requirements that were visible across a number of different
> use cases.
>
> We need to address the issues that get in the way of sharing data online,
>> for both publishers and reusers of data. For data re-users, the primary
>> roadblock that I see is that publishers often put out data that is
>> difficult for them to work with for various reasons.
>
>
> Several in the working group including myself argued this point, but the
> majority of the working group felt BP should be oriented to publishers (I
> believe this was winter 2015).   From what I recall, the consensus came
> when it was agreed that the BP document should be for publishers of the
> data and that any discussion of consumption should be handled in the
> vocabulary documents.
>
> Perhaps there an opportunity to fortify both vocabulary documents to meet
> some shortcomings on reuse, but there is a question of how the BP document
> refers to the vocabularies to make publishing and reuse more cohesive.  Is
> this a possible F2F topic on the better integration of BP and the
> vocabularies?
>
>
>> Lack of webbiness is one of those reasons. For publishers, the inability
>> to monitor usage is the primary issue. I see the BP document and the DQV as
>> addressing the main issue for data re-users and the DUV as addressing the
>> main issue for publishers. Of course, all the docs are potentially useful
>> for all parties, but I think we need to better address the prime motivators.
>>
>> Again it was previously decided that the BP document is largely data
> publisher focused and that consumption should be handled elsewhere.  What
> aspects of the DQV and DUV aren't handling usage monitoring?  What
> requirements are missing?
>
>
>> My concern with our process is that it hasn't exposed issues like the one
>> about data cleanliness, which I think is fundamental to re-use. I find it
>> concerning that at this point in our work we don't already all have an
>> understanding of what that issue is. (I mentioned it back in January, but
>> somehow we let it go--myself included, so mea culpa as well.) As a
>> developer whose job involves creating analysis tools for data on the web, I
>> can tell you that dirty data is a key source of frustration for people like
>> me. By cleanliness, I mean the extent to which I can use data without
>> having to alter it to make it amenable to processing. When making a
>> visualization from existing data, the first step is to clean it--to find
>> missing values, inconsistencies, and values coded in misleading ways, such
>> as 99 instead of "not available" or zero for "no response" that, when
>> plotted, will make nonsense of the visualization. I am much, much more
>> likely to re-use a dataset that is reasonably clean than one that is not,
>> because it simply requires much less work. A clean dataset encourages me to
>> trust the publisher, giving me confidence that I won't end up working with
>> a messy dataset that I thought would be a clean one.
>>
>>
> Perhaps this is a follow on working group activity or could be handled by
> one or more "devil in the details" data community groups?  From the
> discovery perspective projects like RDESC have different kinds of needs.
>
>
>> Use cases don't reveal these sorts of issues very well, because they are
>> about determining requirements, what is needed rather than how it should
>> best be provided. Fortunately, the members of this group know enough about
>> publishing on the web to have extracted some useful ideas despite the
>> approach, so we have something that is a little helpful. I think we could
>> be even more helpful by considering how we might more directly address what
>> makes data sharing on the web difficult for our readers. Maybe we should be
>> meeting with developer groups, having a more formal presence at relevant
>> conferences, or doing surveys. What other things could we be doing to make
>> the BP document more insightful?
>>
>
> Because our working group is within the Data Activity, I have always felt
> open data is more about broadening the appeal of data on the web thinking
> beyond the linked data (semantic web) community.  Imo BP would be a success
> if it provided guidance to data publishers not in the linked data community
> who could adopt similar principles.
>
> I don't disagree with any of the valid points on data published and
> re-used on the web,  I am wondering how some point these points could be
> tackled in the vocabularies.
>
>
>
>
>> -Annette
>>
>> On Aug 15, 2015, at 10:45 AM, Erik Wilde <dret@berkeley.edu> wrote:
>>
>> > portals like this exist in many countries now, and many of them are
>> rather un-webby. giving them a focused and reasonable checklist of things
>> they should do to become more webby, and explaining why, would be great
>> guidance coming from the W3C and being applicable to many of the
>> e-government activities going on nowadays.
>> >
>> > btw, that's exactly how web data started: for document-driven domains,
>> telling people to publish their data in RDF is non-sensical. but that's
>> what linked data tells them to do. web data is an attempt to encourage
>> people to be webby, without prescribing the metamodel they have to use.
>> it's about how to be webby without having to be semwebby.
>>
>>
>>
>
>


-- 
Bernadette Farias Lóscio
Centro de Informática
Universidade Federal de Pernambuco - UFPE, Brazil
----------------------------------------------------------------------------
Received on Thursday, 20 August 2015 22:24:49 UTC