Re: Observations about facts in genomics from Jerven Bolleman on 2013-03-21 (public-semweb-lifesci@w3.org from March 2013)

From: Jerven Bolleman <me@jerven.eu>
Date: Thu, 21 Mar 2013 15:42:12 +0100
To: Phillip Lord <phillip.lord@newcastle.ac.uk>
Cc: Graham Klyne <graham.klyne@zoo.ox.ac.uk>, Jeremy J Carroll <jjc@syapse.com>, w3c semweb HCLS <public-semweb-lifesci@w3.org>, Pat Hayes <phayes@ihmc.us>
Message-ID: <CAHM_hUMkCYKXHgdu3q2E=bwHgF7Oy2MwFwzJNUrh7Fx5O5ejAw@mail.gmail.com>
On Thu, Mar 21, 2013 at 2:55 PM, Phillip Lord
<phillip.lord@newcastle.ac.uk> wrote:
> This is a broken definition of "good" to my mind. It suggests that we
> should make all the distinctions that we can make, all the time.
> Unfortunately, this means that everyone bears the cost of the complexity
> all the time also.
True but the other option is the current situation where we all bear
the complexity of not knowing what we someone is really talking about.
Leading to merging of information that should never have been merged
and conclusions that are not worth the pixels they are displayed on.
Sure there is a cost to ever more complex representations of
information to match reality and this is not what I am advocating. I
am advocating give reality a different IRI than the model.

In this case in stead of saying chrM it would have been solved by
using http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5
sure it adds extra complexity in the naming but it also allows a lot
of access points for your LIMS to add essential information. Its not
the data model that changed its what the model is identified as that
has changed. Making the subtle difference between the concept and the
model of the concept.

>
> A good data model should be an accurate reflection of biology.
A good data model should be an accurate reflection of the data first.
Your data once collected does not change, your concepts of the biology
does/should over time.

> But it
> should also be a convienient model of biology. And the distinction that
> you are making is relevant to only a subset of use cases.
Sure, but it has an advantage of also distinguishing between the ChrM
of patientA and patientB by virtue of having an unique id. Messed up
samples and records is an potential issue in all labs.
> Make the
> distinctions you care about. Let others make the distinctions that they
> care about.
Agree, just identify your distinction! Also the data format is not the
user interface (or at least it should not be so in an ideal world)

Jerven
>
> Phil
>
>
>
>
> Jerven Bolleman <me@jerven.eu> writes:
>> This is fine in RDF, the important thing to separate is the concept of
>> a Chromsome/Patient sequence and a set of observations and hypothesis
>> about that Chromosome sequence.
>>
>> So instead of chromosome M you are really talking about assembly X of
>> a set of reads R mapped via some (variant calling) processes to
>> reference chromosome C that is also really an assembly of a different
>> set of reads. Subtly different and not always made explicit in
>> conversation, but for good RDF you representations you should.
>>
>> In RDF here you need to be careful about what you are identifying. As
>> long as you are correct in what you identified (in this case an
>> variant called, mapped assembly) instead of what you are discussing in
>> english (a patients chromosome)  you will end up fine. If you do this
>> you don't need anything as exotic as frames etc...
>>
>> Regards,
>> Jerven
>>
>> On Wed, Mar 20, 2013 at 9:23 PM, Graham Klyne <graham.klyne@zoo.ox.ac.uk> wrote:
>>> Hi Jeremy,
>>>
>>>
>>> On 20/03/2013 16:04, Jeremy J Carroll wrote:
>>>> One of the things I am learning about genetic sequencing is this process,
>>>> which is meant to tell you about the patient's DNA, is in fact somewhat
>>>> problematic, resulting in facts which are disputable.
>>>>
>>>
>>> It gets worse... the association between sequence fragments and genes
>>> changes over time as knowledge is improved, I understand in ways that isn't
>>> always reflected in published information.  GMOD/CHADO
>>> (http://gmod.org/wiki/Introduction_to_Chado) keeps all the concepts very
>>> separate to allow for this, but the translation to RDF can get very
>>> convoluted (Al Miles did some work on a mapping, a few years ago).
>>>
>>> I also understand that there's emerging research that shows that non-coding
>>> regions, which were previously thought to be meaningless/irrelevant, do
>>> actually have relevant roles in the overall genetic machinery (something to
>>> do with regulation?).
>>>
>>> One of the many reasons I'd like RDF to have some flexibility to deal with
>>> contexts, or differing worldviews, is to allow representation of evolving
>>> information without having to make explicit all those things that
>>> researchers sometimes don't bother to make explicit (e.g. genes vs proteins,
>>> sequence vs gene, etc.).  And then there all the stuff we don't yet know to
>>> make explicit. ("frame problem", anyone?)
>>>
>>> #g
>>> --
>>>
>>>
>>>
>>> On 20/03/2013 16:04, Jeremy J Carroll wrote:
>>>>
>>>> Pat Hayes wrote:
>>>>
>>>> "[RDF] is intended for recording data, and most data is pretty mundane
>>>> stuff about which there is not a lot of factual disagreement."
>>>>
>>>> One of the things I am learning about genetic sequencing is this process,
>>>> which is meant to tell you about the patient's DNA, is in fact somewhat
>>>> problematic, resulting in facts which are disputable.
>>>>
>>>> So, a data file that I am trying to get my head around at the moment
>>>> contains a line like:
>>>>
>>>> chrM    942     rs28579222      A       G       .       .
>>>> ASP;HD;OTHERKG;RSPOS=942;SAO=0;SF=0;SSR=0;VC=SNV;VP=050000000005000402000100;WGT=1;dbSNPBuildID=125
>>>>
>>>>
>>>> So far, I have understood the first five fields, as saying that in a
>>>> particular position in the DNA (the 942nd base in the mitochondrial DNA, aka
>>>> rs28579222), when one might have expected to see an A a sample had a G.
>>>> But that last part "a sample had a G" is in fact open to doubt … There is
>>>> a complex piece of chemistry, physics and computing that guesses that there
>>>> is a G in that position. It is possible to see some of the less processed
>>>> data that fed into that guess, and to see levels of confidence that the
>>>> different algorithms had with the results; but it is not a slam dunk by any
>>>> means. So, some more skeptical people want to be able to see the 'raw read
>>>> data' prior to the decision that this is a G. Usually one would expect to
>>>> see some of the raw read data agree with the G, and some disagree.
>>>>
>>>>
>>>> Since this assertion (that this position is a G) is made with a few
>>>> million similar assertions, all of which have some element of doubt - it
>>>> would be highly surprising if every single call were correct: yet within the
>>>> logic of RDF we probably end up asserting the truth of the whole graph …
>>>> which leads us onto the dangerous path of ex contradictione quadlibet
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
> --
> Phillip Lord,                           Phone: +44 (0) 191 222 7827
> Lecturer in Bioinformatics,             Email: phillip.lord@newcastle.ac.uk
> School of Computing Science,            http://homepages.cs.ncl.ac.uk/phillip.lord
> Room 914 Claremont Tower,               skype: russet_apples
> Newcastle University,                   twitter: phillord
> NE1 7RU



-- 
Jerven Bolleman
me@jerven.eu
Received on Thursday, 21 March 2013 14:42:43 UTC