RE: 'Variants' and Chromosome Modelling from Freimuth, Robert R., Ph.D. on 2013-03-26 (public-semweb-lifesci@w3.org from March 2013)

From: Freimuth, Robert R., Ph.D. <Freimuth.Robert@mayo.edu>
Date: Tue, 26 Mar 2013 17:58:08 +0000
To: Jeremy J Carroll <jjc@syapse.com>, w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-ID: <76A706C559A90249BA321EE35470B85722EC5CBE@MSGPEXCEI12A.mfad.mfroot.org>
> Hmmmm, there are a lot of modeling questions in there.

The adage "all models are wrong, but some are useful" comes to mind.  To answer these questions, you need to define your use cases.  What are you trying to model?  Why?  How is the data going to be used?

Are you trying to model the sequencing and primary data analysis steps?  Or metadata about the sequencing technology and instrument/platform?  Or a physical piece of DNA?  Or the myriad annotations that can be associated with a region of DNA sequence (each with their own provenance)?  Or the current state of our collective knowledge of molecular and cellular biology related to a given DNA sequence?  Or the clinical phenotype (e.g., disease) and/or treatment options that might be related to a particular variant?  Or...

It is possible to create a very intricate model that represents all of this (and more).  However, it is likely unnecessary (unless you've got some monster use cases).

The mol bio/genetics community has spent decades refining ways to express genetic data.  One could create a model for a VCF file, but I'm not sure it would be all that useful.  VCF was developed (in part) to be a compact file format for representing a list of genetic variants.  By definition, it includes only the differences from some reference sequence.  It was not intended to be an accurate model of biology.

I've spent some time thinking about and exploring ways to express genetic data in RDF.  I have yet to find a compelling example where the RDF representation has a significant advantage (and in most cases the opposite is true)..  That said, it is quite possible that someone more proficient in RDF will succeed where I have not, and I look forward to the day if/when that occurs.

> All my ideas seem at least a little awkward.

Indeed.

It will certainly be important to track metadata about the sequence analysis method, etc.  In some cases it will be important to have information about the confidence or quality score for a base call at a given position (most likely to aid reconciliation efforts, when multiple sequences for the same sample are obtained).  Haplotype phasing will also start to become an issue as techniques are developed to determine it experimentally (as opposed to the statistical approaches that are currently used).  I suppose all this could be expressed in RDF, provided there is a use case driving the effort.

In my opinion, the real potential of using semweb technologies with genetic data is in the layers of interpretation that are built from the genetic sequences.  While the underlying genetic sequences can be rebuilt and refined over time, there are plenty of existing tools that can manage this process very efficiently.  Our collective knowledge about those sequences, however, advances continuously.  Changes in our understanding in one area might cascade into others.  We need a way to dynamically update the interpretations and discover novel relationships.  Genetic data (in any format) + biological knowledge (in RDF?) + reasoners could be a powerful combination.

What is the impact of a genetic variant at a given location?  This is a hot field of study within genetic/bioinformatic research, and solutions to this problem will be critical for clinical personalized medicine programs.

Bob


________________________________
From: public-semweb-lifesci-request@listhub.w3.org [mailto:public-semweb-lifesci-request@listhub.w3.org] On Behalf Of Jeremy J Carroll
Sent: Thursday, March 21, 2013 5:01 PM
To: w3c semweb HCLS
Subject: 'Variants' and Chromosome Modelling


Jerven suggests:

"instead of saying chrM it would have been solved by
using http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5"

rather than continuing the philosophical/theological threads ....
I am interested in this practical question.


chrM as an address

I am wanting to represent bases on chrM, how should do I do this?

My current intent is to continue with the model and the ontology implicit in the VCF format (1000 genomes) and make somewhat more explicit.

In this model "chrM: 5000 - 5003" identifies 4 bases (inclusive end point) in the mitochondrial DNA in some reference assembly .... if I have understood correctly, and the items of interest to be modeled are variations against that reference assembly. In this model, we may choose to use an address like "chrM: 5000 - 5003" to identify some part of a reference assembly from which the current experimental assembly differs.

In this way of thinking, I am not really interested in an assembly of ChrM for patient XXYZZ's sampleXX, and so Jerven URI to refer to that is not so useful.
I guess I am surprised to see Jerven suggesting a URI in which the assembly is part of the ChrM rather than the other way round.


variants, defaults, non-monotonic reasoning

Part of my problem here is to do with defaults and diffs and knowledge and modeling ....
In general, the smart money likes monotonic reasoning as opposed to non-monotonic reasoning; because of reasoning tractability issues. Defaults, diffs, variants, all tend to non monotonic reasoning, or closed world assumptions or ... since if I have not been told that a particular sample's assembly has a variant from the reference assembly at a particular position then I effectively assume that the base in my sample's assembly is the same as the base in the reference assembly. In practice this is then an issue when the quality of the non-variant call is questionable. (see
https://sites.google.com/site/gvcftools/home/about-gvcf
concerning non-variant sites)

My gut feel is that these concerns, while theoretically well-founded, are practically irrelevant - we simply need to engineer our knowledge systems so that we do have 'complete' variant information, and some awareness that any individual call (either variant or non-variant) may be wrong. 'complete' may have a rather parochial system-specific meaning ...

Without the defaults, and the diffs and all the rest, the storage and query tractability issues appear overwhelming .... and so there isn't really any practical choice here.

phases of analysis

Analysis of the raw experiments in sequencing machines takes place in phases; and each phase does in practice need to assume the results of the previous phase; with some awareness of the shades of grey in such assumptions. Each phase essentially passes only 'output' to the next stage, and we cannot, in practice, forever return to the raw data to justify every step at every stage.



practical ? proposal for representing an assembly of a patient's sample


_:sample eg:sampleFromFile   <ftp://example.org/mypatientsample.vcf> .

# metadata headers from VCF file, cleaned up somewhat
<ftp://example.org/mypatientsample.vcf> vcf:reference <http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.17/>
<ftp://example.org/mypatientsample.vcf> vcf:fileDate "2012-06-26"^^xsd:date .

# each row of the data of the VCF file becomes something like

_:sample eg:hasVariant [

   eg:aboutGenomePosition [
# we use a restricted vocabulary of chromosome names
       eg:chromosome eg:chrM ;
       eg:startPosition "5000"^^xsd:int ;
       eg:endPosition "5003"^^xsd:int ;
       eg:referenceSequence _:ref5000 ;
       eg:alternateSequence _:alt5000
 # more stuff from ID, ALT, QUAL, FILTER and INFO fields of VCF
   ]

# some mapping of the per-sample field
# e.g. in 1000 genomes data FORMAT=GT:DS:GL 1|0:1.000:-1.69,-0.01,-5.00
# the 1|0 is a phased genotype call
   eg:GT [
      eg:phase _:p1 ;
      eg:gtCall _:alt5000 ;
   ]
   eg:GT [
      eg:phase _:p2 ;
      eg:gtCall _:ref5000 ;
   ]
].
_:ref5000 eg:sequence "ACTG" .
_:alt5000 eg:sequence "A" .



Hmmmm, there are a lot of modeling questions in there. The VCF file format has some answers, but not very good ones, partly because the questions do not appear to have been asked as modeling questions.
It seems pretty unclear to me how to include the GL (Genotype Likelihood) values in there. I think these are used to help make the genotype call; and then kept around in case you don't like the call.
The phasing also seems problematic, since it seems that it is generally useful information as to which strand which allele was seen on, (for example for hapliotype identification) but in practice we can't trace a strand all the way through a chromosome.

Further the genotype call may be phased (ordered with respect to genotype calls at at least one other position), or unphased (i.e. an unordered pair); and the two values may be the same or different - the best way to model that is ??? All my ideas seem at least a little awkward.

Or would it be better just to dump this stuff in an RDB, and be done with it.

Jeremy
Received on Tuesday, 26 March 2013 18:00:30 UTC