Re: Observations about facts in genomics

Well, I don't quite know what to say. I feel a bit like a designer of cheap, workable, everyday town cars, and I have a customer who wants a Ferrari. 

I agree, Jeremy, you have a hard problem here. It sounds like you need statistical or probabilistic methods to keep track of these small likelihoods of your data being wrong in various interconnected ways. And you might need to have some quite elaborate distinctions between hypotheses in order to apply them properly, and keep your hypothetical information separately from the actual raw data. All of which goes beyond anything that poor little RDF was designed to be able to handle. It still might be worth using RDF syntaxes and datastructures, of course, but you should expect to be quite often operating outside the normative model that RDF reasoners are supposed to rely on. 

From my current location on the RDF WG, I see your problem as way out in left field, and the 'linked data' people using things like JSON  and making mashups of geo/social data, way out in right field, and the need to repect legacy implementations keeping me firmly in the middle, in some places underground. I would love to keep everyone happy with a single spec, but I don't think its going to be possible. 

Pat

PS. I don't think it is as bad as quodlibet. Reasoners usually flag inconsistencies as soon as they find them, and if you have any kind of provenance machinery in place you can unwind any bad conclusions you might have made too greedily. Its really just dynamic garbage collection, actually. If you are computing probabilities, this is the limiting/boundary case of what you will already be doing, when pr(P)=0.

On Mar 20, 2013, at 11:04 AM, Jeremy J Carroll wrote:

> Pat Hayes wrote:
> 
> "[RDF] is intended for recording data, and most data is pretty mundane stuff about which there is not a lot of factual disagreement."
> 
> One of the things I am learning about genetic sequencing is this process, which is meant to tell you about the patient's DNA, is in fact somewhat problematic, resulting in facts which are disputable.
> 
> So, a data file that I am trying to get my head around at the moment contains a line like:
> 
> chrM	942	rs28579222	A	G	.	.	ASP;HD;OTHERKG;RSPOS=942;SAO=0;SF=0;SSR=0;VC=SNV;VP=050000000005000402000100;WGT=1;dbSNPBuildID=125
> 
> 
> So far, I have understood the first five fields, as saying that in a particular position in the DNA (the 942nd base in the mitochondrial DNA, aka rs28579222), when one might have expected to see an A a sample had a G.
> But that last part "a sample had a G" is in fact open to doubt … There is a complex piece of chemistry, physics and computing that guesses that there is a G in that position. It is possible to see some of the less processed data that fed into that guess, and to see levels of confidence that the different algorithms had with the results; but it is not a slam dunk by any means. So, some more skeptical people want to be able to see the 'raw read data' prior to the decision that this is a G. Usually one would expect to see some of the raw read data agree with the G, and some disagree.
> 
> 
> Since this assertion (that this position is a G) is made with a few million similar assertions, all of which have some element of doubt - it would be highly surprising if every single call were correct: yet within the logic of RDF we probably end up asserting the truth of the whole graph … which leads us onto the dangerous path of ex contradictione quadlibet
> 
> 
> 
> 
> 
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes

Received on Thursday, 21 March 2013 17:32:04 UTC