W3C home > Mailing lists > Public > public-lod@w3.org > March 2010

Re: [uk-government-data-developers] Nice Data Cleansing Tool Demo

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Sun, 28 Mar 2010 16:03:09 -0400
Message-ID: <4BAFB5FD.5060407@openlinksw.com>
To: Leigh Dodds <leigh.dodds@talis.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
Leigh Dodds wrote:
> Hi,
>
> On Sunday, March 28, 2010, Kingsley Idehen <kidehen@openlinksw.com> wrote:
>   
>> All,
>>
>> A very nice data cleansing tool from David and Co. at Freebase.
>>     
>
> Yes, it looks very nice. Am looking forward to working with it.
>
>   
>> CSVs are clearly the dominant data format in the structured open data
>> realm. This tool deals with ETL very well. Of course, for those who
>> appreciate OWL, a lot of what's demonstrated in this demo is also
>> achievable via "context rules".
>>     
>
> Can you (or others) expand on that?
>
> Much of the power in the demo seemed to me to be in the facetting,
> scripting of cleansing, analysis of value spaces, etc.
>
> I'd be interested to know how OWL could be applied here.
>
> Cheers,
>
> L.
>
>   
Leigh,

OWL comes in post load of the data into the Quad Store (clean or dirty). 
Note, this demo is based on Literal values cleansing. When you have data 
object identifiers in play you aren't confined to joining data via 
Literal Values (key difference between RDBMS realm and RDF and other 
Graph Model realms).

1. Co-reference - via owl:sameAs assertions
2. Dirty Data - use of procedure functions and inverse functional 
properties
3. Units of Measurement - leveraging locale prowess of HTTP re. ability 
to identify locale of user agents combined with TCN QoS algorithms 
(which can be part of SPARQL as we've done re. Virtuoso)

You can make rules that incorporate all of the above, you can even do so 
with SPARQL (plus function/magic predicates) as the Rules Language for 
constrained forward-chaining in more extreme cases.

I can load a dirty CSV file into Virtuoso, and leverage OWL, SPARQL, 
Function/Magic Predicates en route to handling:

1. Semantic Disparity
2. Structural Disparity
3. Entity Co-References.

Naturally, someone could, and eventually would, write a data 
reconciliation tool that looked like Microsoft Access and basically 
delivered delivered on the above, while simply ridding Virtuoso engines 
(ditto any other Quad Store with similar capabilities). Its all going to 
happen quicker than most will expect, especially now that OData is part 
of the mix re. granular structured linked data, and the universal nature 
of the Entity-Attribute-Value model is getting clearer to broader 
audiences by the second :-)

Links:

1. http://bit.ly/csFCqC -- Data Reconciliation using TimBL as subject 
(note the co-reference and indirect-coference tab data which offers a 
teaser) .

-- 

Regards,

Kingsley Idehen	      
President & CEO 
OpenLink Software     
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 
Received on Sunday, 28 March 2010 20:03:38 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:25 UTC