Re: Enrichment document

Wagner, everyone,

I've been reading through the Enrichment doc this morning and have some 
comments/suggestions aimed at making use of the obvious expertise at 
InWeb within our current framework.

The sections on categorisation and segmentation seem, in my reading, to 
be about specialisations of the more general topic of 'providing 
metadata.' Would it be possible to look at the BPs in the metadata 
section (http://www.w3.org/TR/dwbp/#metadata) and add to them?

Whether we're talking about machine readable formats or human readable 
stuff, metadata is important of course and, yes, machines are getting 
better at extracting useful information from all kinds of sources.

I find the section on Imputation particularly interesting. The details 
of the techniques for doing this are outwith the scope of this WG but 
recording that imputation techniques have been used would, I think, be 
something to record using the DQV?

Entity Recognition, Data Disambiguation and Fusion - I think we could 
derive discrete BPs about these, but I'd phrase it as "re-use other 
people's identifiers," e.g. if you're providing data about a chemical 
compound, use the same ID as everyone else (actually there are several 
competing ID sets for chemical compounds). That helps disambiguation and 
fusion.

NB. Freebase is being shut down by Google. I'll forward a separate mail 
that got into my inbox this week.

Another BP (I don't think we've covered this yet) is 'be consistent' in 
your naming. So, if you refer to a country by its name, use the same and 
and capitalisation: Brazil, Brasil, brasil, BR etc are all different.

Might that sort of approach work? i.e. enriching the current BPs (pardon 
the pun).

Phil.





On 25/06/2015 18:41, Annette Greiner wrote:
> +1 for refining this. -1 for keeping it focused on text.
> I think you need to consider the purpose. If the purpose is to define data enrichment in a way that helps people understand it and publish better data, then I think you must include other types of data beyond textual. There is absolutely no question whether the practices generalize to other types of data. In the sciences, enrichment of image data is a huge part of visualization practice (segmentation, adding analyses). If you need a use case, see the one on mass spectrometry imaging. I can offer many others. If the BP document seems particularly focused on textual data, I think that is a problem. (Note that by “textual data” I don’t mean any data that can be represented by ascii text. What this piece tends to wander into is the use of text as a corpus for natural language processing.) I think this piece can be helpful, but it should be less about the details of machine learning and more about the concepts behind data enrichment that will help people publish better data.
> -Annette
>
> --
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
> 510-495-2935
>
> On Jun 24, 2015, at 2:58 PM, Wagner Meira Jr. <meira@dcc.ufmg.br> wrote:
>
>> Hi all,
>>
>> We, from InWeb, agree with the various observations and believe that
>> refining it is the way to achieve something relevant wrt data enrichment
>> and DWBP.
>>
>> First of all, it was never our intention to disrupt the usual process for
>> generating and publishing W3C documents. We presented a very first
>> version of the document in the last F2F meeting and since then
>> have been extending it. We were not sure about the post presentation
>> process and are sorry for the confusion.
>>
>> Second, we agree that it may not be a good idea to use "big data",
>> even as a background motivation for the document, as it is. The
>> practices reported are applicable to any kind and volume of data.
>>
>> Third, extending the discussion beyond textual data, although
>> relevant, may represent a much broader scope. We suggest
>> to first focus on textual data and later evaluate whether the practices
>> generalize to other types of data.
>>
>> Fourth, our rationale in building the document was not really to
>> provide an exhaustive overview of the methods and techniques for
>> data enrichment, but what criteria any such method should satisfy.
>> We understand that this strategy is more compatible with the current
>> DWPB draft.
>>
>> Finally, we have discussed quite a lot about whether data enrichment
>> makes sense in the DWBP draft, and so far, the conclusion has been
>> that it fits there, but such outcome may change as  we deepen the
>> discussion in the group. We suggest that you take the current draft just
>> as a suggestion on how to approach data enrichment in the context of
>> DBWP.
>>
>> Thus, how do you guys want to evolve on this issue?
>>
>> Best,
>>
>> Wagner
>>
>> On 24-06-2015 11:08, yaso@nic.br wrote:
>>> Hi everyone,
>>>
>>> I agree with both Annete and Antoine, but still think it is an important
>>> issue to be discussed by the group. I understand your fears on turning
>>> back to our scope endless discussion, but there are specific points,
>>> like the one that Antoine raised on enriching linked data, for example,
>>> that are very useful for us considering the scenario of the web nowadays.  ]
>>>
>>> Keep in mind that our scope is already well defined:
>>>
>>> "This document is concerned solely with best practices that:
>>>
>>> 1. are specifically relevant to data published on the Web;
>>> 2. encourage publication or re-use of data on the Web;
>>> 3. can be tested by machines, humans or a combination of the two.
>>>
>>> As noted above, whether a best practice has or has not been followed
>>> should be judged against the intended outcome, not the possible approach
>>> to implementation which is offered as guidance.
>>>
>>> A best practice is always subject to improvement as we learn and evolve
>>> the Web together."
>>>
>>> Maybe this note on data enrichment note [1] can turn in to a use case. I
>>> think that reviewing it with an eye for the challenges that we might
>>> raise from the InWeb work can be a good idea, since we already went thru
>>> this process for the other Best Practices.
>>>
>>>
>>>
>>> yaso
>>>
>>>
>>>
>>> [1] https://w3c.github.io/dwbp/enrichment.html
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 06/23/2015 05:17 PM, Antoine Isaac wrote:
>>>> Hi,
>>>>
>>>> I fully support Annette's point about the fear of including
>>>> recommendations about everything, even when not really specific to the web.
>>>>
>>>> As far as the content of the document is concerned, I must confess I've
>>>> never looked at it. And even though I've missed a couple of calls
>>>> lately, I don't remember any formal request for review has been ever
>>>> made...
>>>>
>>>> It's a pity, because the document may contain some very good stuff. But
>>>> it may also be very shaky of others. For instance, I have the feeling it
>>>> ignores many things done for enriching linked data. And work on
>>>> evaluating the results - actually it's confusing to find that a fairly
>>>> long document on data enrichment would only have three occurrence of
>>>> 'quality' in it. Probably it will be good to discuss this also in the
>>>> coming calls.
>>>>
>>>> Finally, there are quite big typos, even in the header. For example,
>>>> "Desirible".
>>>>
>>>> Best,
>>>>
>>>> Antoine
>>>>
>>>> On 6/23/15 8:52 PM, Annette Greiner wrote:
>>>>> Hi Steve,
>>>>> I think you're right that "big data" gets used to mean just plain
>>>>> data. If the distinction between the meaning of "big data" and "data"
>>>>> is becoming an academic one, isn't that even more reason to avoid
>>>>> trying to make the distinction in our own work? Let's just call data
>>>>> data. If we want to talk about the Vs, we can use the V words.  I
>>>>> actually work in academia, in the data science program at Berkeley,
>>>>> and the consensus even there about the term is that it is not very
>>>>> helpful. It is perceived as shallow and attention-seeking.
>>>>>
>>>>> Re the scoping issue, you misunderstand me, and looking back at the
>>>>> placement of my last sentence, I can see why. (Sorry.) I don't think
>>>>> that addressing the full meaning of data enrichment would throw this
>>>>> piece out of scope. If anything, it would bring it back in. I think
>>>>> its current failure to address the broader meaning of enrichment is a
>>>>> serious problem, separate from the scoping issue. In addition, I
>>>>> think that we should always ask ourselves whether what we are writing
>>>>> is relevant in particular to data on the web. I don't think there is
>>>>> anything particularly web-based about machine learning.
>>>>>
>>>>> I worry that we are slowly trying to write something about every
>>>>> aspect of the data lifecycle. It's difficult enough for me to accept
>>>>> the extra BPs about how to create a vocabulary, and I worry about the
>>>>> data preservation BPs on similar grounds. Machine learning strikes me
>>>>> as further afield than either of those. Should we also write notes
>>>>> about hadoop, database administration, data visualization, and survey
>>>>> design? If we define our scope this broadly, what would we rule out?
>>>>> -Annette
>>>>>
>>>>> On Jun 23, 2015, at 10:32 AM, Steven Adler <adler1@us.ibm.com> wrote:
>>>>>
>>>>>> Annette,
>>>>>>
>>>>>> At first I agreed but then I have to say that I don't...  because
>>>>>> "Big Data" is over-used and somewhat amorphous it is becoming a term
>>>>>> used by everyone for much of what we might also narrowly define as
>>>>>> "just Data."  ie, the distinction is increasingly academic.
>>>>>>
>>>>>> Also, I think we did discuss in the past that unstructured text,
>>>>>> image, audio, and other multi-media types is also data on the web
>>>>>> that is published in open formats.
>>>>>>
>>>>>> So really, I don't see the harm in the inclusion on the basis of
>>>>>> those objections because I hope that additional data types are not
>>>>>> tangential to our standards.
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>> Motto: "Do First, Think, Do it Again"
>>>>>>
>>>>>> <graycol.gif>Annette Greiner ---06/23/2015 12:30:52 PM---Hm, I had
>>>>>> never seen that enrichment document and didn't even realize it was in
>>>>>> development. It give
>>>>>>
>>>>>> <ecblank.gif>
>>>>>> From:
>>>>>> <ecblank.gif>
>>>>>> Annette Greiner <amgreiner@lbl.gov>
>>>>>> <ecblank.gif>
>>>>>> To:
>>>>>> <ecblank.gif>
>>>>>> Phil Archer <phila@w3.org>
>>>>>> <ecblank.gif>
>>>>>> Cc:
>>>>>> <ecblank.gif>
>>>>>> Public DWBP WG <public-dwbp-wg@w3.org>, Bernadette Farias Lóscio
>>>>>> <bfl@cin.ufpe.br>, Caroline Burle <cburle@nic.br>, Newton Calegari
>>>>>> <newton@nic.br>, "glpappa@dcc.ufmg.br" <glpappa@dcc.ufmg.br>
>>>>>> <ecblank.gif>
>>>>>> Date:
>>>>>> <ecblank.gif>
>>>>>> 06/23/2015 12:30 PM
>>>>>> <ecblank.gif>
>>>>>> Subject:
>>>>>> <ecblank.gif>
>>>>>> Re: Enrichment document
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hm, I had never seen that enrichment document and didn't even realize
>>>>>> it was in development. It gives a nice review of machine learning
>>>>>> techniques with a focus on text analysis. Very interesting stuff, but
>>>>>> I have a few concerns. My primary concern is that it defines data
>>>>>> enrichment much too narrowly. Data enrichment is helpful for all
>>>>>> kinds of data, not just "big data" (a term I would encourage us to
>>>>>> avoid, as it is overused and highly ambiguous). It is useful in image
>>>>>> data as well as text, and in structured as well as unstructured data.
>>>>>> I think we need to beware of putting out content that is tangential
>>>>>> to the subject of publishing data on the web.
>>>>>> -Annette
>>>>>>
>>>>>> Sent from a keyboard-challenged device
>>>>>>
>>>>>>> On Jun 23, 2015, at 7:00 AM, Phil Archer <phila@w3.org> wrote:
>>>>>>>
>>>>>>> I'm putting the DWBP doc through pubrules and, forgive me, I've just
>>>>>>> noticed that it links to the enrichment document.
>>>>>>>
>>>>>>> For those unfamiliar with this, see
>>>>>>> http://w3c.github.io/dwbp/enrichment.html
>>>>>>>
>>>>>>> The WG may well decide to publish this - it certainly deserves
>>>>>>> attention and may well be published. However, we can't just include
>>>>>>> it as a separate Note without going through the usual process
>>>>>>> followed by other documents in the WG.
>>>>>>>
>>>>>>> For this week's publication I have therefore removed "... according
>>>>>>> to the suggestions described in Data Enrichment Technical Note" from
>>>>>>> the BP doc and the link to the enrichment doc.
>>>>>>>
>>>>>>> Let's put this on the agenda for a near future call.
>>>>>>>
>>>>>>> Phil.
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>> Phil Archer
>>>>>>> W3C Data Activity Lead
>>>>>>> http://www.w3.org/2013/data/
>>>>>>>
>>>>>>> http://philarcher.org
>>>>>>> +44 (0)7887 767755
>>>>>>> @philarcher1
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>
>>
>
>
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1

Received on Friday, 7 August 2015 12:59:46 UTC