data quality from Paola Di Maio on 2010-04-19 (semantic-web@w3.org from April 2010)

From: Paola Di Maio <paola.dimaio@gmail.com>
Date: Mon, 19 Apr 2010 09:51:14 +0000
To: adasal <adam.saltiel@gmail.com>, uk-government-data-developers@googlegroups.com
Cc: Semantic Web <semantic-web@w3.org>
Message-ID: <i2o4a4804721004190251kc8d5923ej69a3a3a7af25bbb8@mail.gmail.com>
Something else I wanted to add but forgot as it was late post:


One of the issues that is coming up related to the discussion below, is the
quaity of data
(which came up in the gov data list a while back, hence in cc)

A question then is: why (in some cases) is the data 'not fit for purpose?'

Again several possible hypotheses in each case may  need to be tested

is the data inconsistent because the real world is inconsistent (the world
seems to hang together even when it does not make sense to us
while data models dont) - in which case maybe there is not much tha we can
do, other than to continue to attempt creating plausible
models of the world

is  the data any use before it is opened and rdfized? or does something
happen in the rdfization process?


lets not forget that to obtain meaningful outputs from dbases, a lot of work
needs to go in it, I am thinking normalisation of schemas
but also, data cleaning, which constitutes a majority of efforts in data
mining

I dont think the fact that data is expressed in RDF would automatically make
it good


Again, a good diggin of a significant set of examples of when 'data is not
fit for purpose' could yield some clues as to what kind of work needs to be
done

So I would be inclined when something doesnt work, not just trhow it away,
but study it systematically


After all, most of what we know in medicine has com from  dissecting corpses


PDM






On Sun, Apr 18, 2010 at 11:16 PM, Paola Di Maio <paola.dimaio@gmail.com>wrote:

> In this thread, and the parallel ones, I see different problem spaces
> its a complex issue that should be broken down
>
> one is the query composition
> another is the availability of data
> and then the ease of use/utility of the tools
> (probably more)
>
> then there are some conflicts, for example on the one hand the w3c
>  produces standards (rdf, owl) on the other hand
> the tools and platforms that implement them, without necessarily  making
>  the user tasks intuitive enough,
>
> having had random conversations with platform developers it looks they want
> to monetize on their work, and are not in a hurry
> to achieve any results until their finances are secured
>
>   perhaps the w3c could act more as 'the customer' , and promote the
> adoption
> of usability standard alongside technical ones (an argument that I
> occasionally try to make)
>
> from the query composition front, what about a tool that would facilitate
> the generation of rdf data
> when its not available?
>
> assuming Danny eventually works out the optimal query, for example to
> include specific data in relation to his side of
> the valley, where humidity, wind, sun exposure, soil composition and other
> local properties make up a microclimate
> shouldnt there be a(any) place where this data could be entered so that the
> query can be performed?
>
> (I still think for some answers the web may still not be the best place,
> but lets think hypotetically)
>
> the nearest thing I have seen to the SW has been when I saw a demo of
> semantic wiki ( Denny Vrandivich )
> one could enter new data on the fly, and the table would update, I thought
> that's cool, I could probably work with this
> (I may to have to go through the examples a few times, but it looked doable
> to me when I so the demo)
>
> surely  query manipulation can be made foolproof, after all forms were
> invented for that purpose if I remember,  (an interface that would allow
> addin more fields to the query)
>
> last i heard of semantic media wiki 'there were issues'
>
> I would have thought thats a good place to start, anyone knows what happens
> and if there are any test implementations or tutorials?
> what can be so wrong with it>
>
> Once  tasks are defined, the data is reliable and good enough, datasets can
> be added on the fly as needed, and the tools are
> straighforward and the tasks (say querying and manipulating queries) are
> made more intuitive, then I am sure its all about setting up
> good enough pilot studies from different fields of application with, for
> each, enough people and community involvmene
>
>
> since everybody is already more or less working on different aspect of the
> above, I am sure that some magic
> can be done simply with a bit more coordination of the different efforts
>
>
> the cost/benefit issue is also complex,
> depending what is calculated as cost and what as benefit, as there are
> different classes of both,
>
> IMHO to society at large,  and to the public pocket the last ten years of
> publicly funded research have been a relatively quantifiable cost
> (can work out  some ballpark figure by looking at the sw research
> expenditure, but I am afraid to do it)
> among the benefits have been lots of phds, salaries, some careers some new
> knowledge and innovation,
> but
> some (including myself) argue
> that visible  'public' benefits are not (yet)  adequate to the public
> costs, which remain imho not fully justified
>
> In my analysis this has turned out to be a  problem with our research
> industry,(very generalised statement) where research expenditure
> is often in a grey policy area, not clearly enough demarcated what public
> benefits should derive from research, and another
> can of worms altogether
>
> to an average organisation that is confronted with the option to invest in
> sw technologies today, it may just be too early , unquantifiable costs and
> risks, but also limited business/revenue models etc
>
> (how is giving my website users some sw functionality is going to provide
> my customers with more value?)
>
> I think I have heard of some benefits being reaped in the non public
> domain, but because of that, we dont know for sure
> what happens behind firewalls
>
> Assuming some cohesion of purpose can be arranged, and  that research can
> provide a wide enough range of more real world well defined
> pilot schemes (where the cost/benefit analysis each pilot project is clear
> upfront, and utility metrics adopted, for example)
> with a sufficiently healthy stakeholder base not too easily alienated, I am
> sure it would be possible to make at least some sense
> of the word done so far
>
> anything anywhere near the direction above can probably only be achieved
>  by a community, which it looks is trying to pull itself
> together here?
>
> :-)
>
>
> PDM
>
> On Sun, Apr 18, 2010 at 10:22 PM, adasal <adam.saltiel@gmail.com> wrote:
>
>> Agriculture oriented data spaces (ontology and instance data)
>>>
>> How could that ever be automatic?
>>
>> Agriculture oriented data spaces (ontology and instance data)
>>>
>> Cannot anticipate every possible query, or even broad area of interest, in
>> DBpedia.
>> There must be an impulse to make a query of some sort. The issue is how
>> complex that query must be.
>> Isn't the implicit question why cannot some small query be enough to draw
>> out the information I want?
>> Here the query terms should be enough to form a coherent query. In this
>> example they should translate into a sparql query. But that is not enough,
>> because DBPedia needs a schema and some instance data too. erm.
>> Or perhaps it could be semi-automatic?
>> Imagine that there is a repository with sample kinds of data in it. I
>> think this would be easy to use.
>> I want to build up a query about tomato seeds, planting, region, time of
>> year. So some general data is classified along those lines. That would be
>> combined into a schema. Maybe some of it would be a subset of other schemas,
>> so in my making the choice further useful suggestions could be made. I would
>> then be asked to refine the parameters of the query by actual region, etc.
>> I am assuming that interested parties would make available basic meta data
>> sets with human understandable sample data.
>>
>> Am I making any sort of sensible suggestion here? Is this different to
>> what already exists as available triples? I am unsure. There is something
>> circular here.
>>
>> Even so we are still left with that data that has not been classified
>> because there is no interested party to do so, or because the type of
>> classification is new, complex or transient.
>>
>> Adam
>>
>> On 18 April 2010 21:56, Danny Ayers <danny.ayers@gmail.com> wrote:
>>
>>> Thanks Kingsley
>>>
>>> still not automatic though, is it?
>>>
>>> On 18 April 2010 22:38, Kingsley Idehen <kidehen@openlinksw.com> wrote:
>>> > Danny Ayers wrote:
>>> >>
>>> >> Kingsley, how do I find out when to plant tomatos here?
>>> >>
>>> >
>>> > And you find the answer to that in Wikipedia via
>>> > <http://en.wikipedia.org/wiki/Tomato>? Of course not.
>>> >
>>> > Re. DBpedia, if you have a Agriculture oriented data spaces (ontology
>>> and
>>> > instance data) that references DBpedia (via linkbase) then you will
>>> have a
>>> > better chance of an answer since we would have temporal properties and
>>> > associated values in the Linked Data Space (one that we can mesh with
>>> > DBpedia even via SPARQL).
>>> >
>>> > Kingsley
>>> >>
>>> >> On 17 April 2010 19:36, Kingsley Idehen <kidehen@openlinksw.com>
>>> wrote:
>>> >>
>>> >>>
>>> >>> Danny Ayers wrote:
>>> >>>
>>> >>>>
>>> >>>> On 16 April 2010 19:29, greg masley <roxymuzick@yahoo.com> wrote:
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> What I want to know is does anybody have a method yet to
>>> successfully
>>> >>>>> extract data from Wikipedia using dbpedia? If so please email the
>>> >>>>> procedure
>>> >>>>> to greg@masleyassociates.com
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>> That is an easy one, the URIs are similar - you can get the pointer
>>> >>>> from db and get into wikipedia. Then you do your stuff.
>>> >>>>
>>> >>>> I'll let Kingsley explain.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>
>>> >>> Greg,
>>> >>>
>>> >>> Please add some clarity to your quest.
>>> >>>
>>> >>> DBpedia the project is comprised of:
>>> >>>
>>> >>> 1. Extractors for converting Wikipedia content into Structured Data
>>> >>> represented in a variety of RDF based data representation formats
>>> >>> 2. Live instance with the extracts from #1 loaded into a DBMS that
>>> >>> exposes a
>>> >>> SPARQL endpoint (which lets you query over the wire using SPARQL
>>> query
>>> >>> language).
>>> >>>
>>> >>> There is a little more, but I need additional clarification from you.
>>> >>>
>>> >>>
>>> >>> --
>>> >>>
>>> >>> Regards,
>>> >>>
>>> >>> Kingsley Idehen       President & CEO OpenLink Software     Web:
>>> >>> http://www.openlinksw.com
>>> >>> Weblog: http://www.openlinksw.com/blog/~kidehen<http://www.openlinksw.com/blog/%7Ekidehen>
>>> >>> Twitter/Identi.ca: kidehen
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> >
>>> > Regards,
>>> >
>>> > Kingsley Idehen       President & CEO OpenLink Software     Web:
>>> > http://www.openlinksw.com
>>> > Weblog: http://www.openlinksw.com/blog/~kidehen<http://www.openlinksw.com/blog/%7Ekidehen>
>>> > Twitter/Identi.ca: kidehen
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> http://danny.ayers.name
>>>
>>>
>>
>
>
> --
> Paola Di Maio
> **************************************************
> “Logic will get you from A to B. Imagination will take you everywhere.”
> Albert Einstein
> **************************************************
>
>


-- 
Paola Di Maio
**************************************************
“Logic will get you from A to B. Imagination will take you everywhere.”
Albert Einstein
**************************************************
Received on Monday, 19 April 2010 09:51:48 UTC