Re: Observations about facts in genomics from Graham Klyne on 2013-03-20 (public-semweb-lifesci@w3.org from March 2013)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Wed, 20 Mar 2013 20:23:46 +0000
To: Jeremy J Carroll <jjc@syapse.com>
CC: w3c semweb HCLS <public-semweb-lifesci@w3.org>, Pat Hayes <phayes@ihmc.us>
Message-ID: <514A1AD2.1020504@zoo.ox.ac.uk>

Hi Jeremy,

On 20/03/2013 16:04, Jeremy J Carroll wrote:
 > One of the things I am learning about genetic sequencing is this process, 
which is meant to tell you about the patient's DNA, is in fact somewhat 
problematic, resulting in facts which are disputable.
 >

It gets worse... the association between sequence fragments and genes changes 
over time as knowledge is improved, I understand in ways that isn't always 
reflected in published information.  GMOD/CHADO 
(http://gmod.org/wiki/Introduction_to_Chado) keeps all the concepts very 
separate to allow for this, but the translation to RDF can get very convoluted 
(Al Miles did some work on a mapping, a few years ago).

I also understand that there's emerging research that shows that non-coding 
regions, which were previously thought to be meaningless/irrelevant, do actually 
have relevant roles in the overall genetic machinery (something to do with 
regulation?).

One of the many reasons I'd like RDF to have some flexibility to deal with 
contexts, or differing worldviews, is to allow representation of evolving 
information without having to make explicit all those things that researchers 
sometimes don't bother to make explicit (e.g. genes vs proteins, sequence vs 
gene, etc.).  And then there all the stuff we don't yet know to make explicit. 
("frame problem", anyone?)

#g
--

On 20/03/2013 16:04, Jeremy J Carroll wrote:
> Pat Hayes wrote:
>
> "[RDF] is intended for recording data, and most data is pretty mundane stuff about which there is not a lot of factual disagreement."
>
> One of the things I am learning about genetic sequencing is this process, which is meant to tell you about the patient's DNA, is in fact somewhat problematic, resulting in facts which are disputable.
>
> So, a data file that I am trying to get my head around at the moment contains a line like:
>
> chrM	942	rs28579222	A	G	.	.	ASP;HD;OTHERKG;RSPOS=942;SAO=0;SF=0;SSR=0;VC=SNV;VP=050000000005000402000100;WGT=1;dbSNPBuildID=125
>
>
> So far, I have understood the first five fields, as saying that in a particular position in the DNA (the 942nd base in the mitochondrial DNA, aka rs28579222), when one might have expected to see an A a sample had a G.
> But that last part "a sample had a G" is in fact open to doubt … There is a complex piece of chemistry, physics and computing that guesses that there is a G in that position. It is possible to see some of the less processed data that fed into that guess, and to see levels of confidence that the different algorithms had with the results; but it is not a slam dunk by any means. So, some more skeptical people want to be able to see the 'raw read data' prior to the decision that this is a G. Usually one would expect to see some of the raw read data agree with the G, and some disagree.
>
>
> Since this assertion (that this position is a G) is made with a few million similar assertions, all of which have some element of doubt - it would be highly surprising if every single call were correct: yet within the logic of RDF we probably end up asserting the truth of the whole graph … which leads us onto the dangerous path of ex contradictione quadlibet
>
>
>
>
>
>
>
>

Received on Wednesday, 20 March 2013 20:41:51 UTC