'Variants' and Chromosome Modelling from Jeremy J Carroll on 2013-03-21 (public-semweb-lifesci@w3.org from March 2013)

From: Jeremy J Carroll <jjc@syapse.com>
Date: Thu, 21 Mar 2013 15:00:42 -0700
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-Id: <288EDA40-5274-4725-9E74-FC9EFF726008@syapse.com>

Jerven suggests:

"instead of saying chrM it would have been solved by
using http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5"

rather than continuing the philosophical/theological threads ….
I am interested in this practical question.

chrM as an address

I am wanting to represent bases on chrM, how should do I do this?

My current intent is to continue with the model and the ontology implicit in the VCF format (1000 genomes) and make somewhat more explicit.

In this model "chrM: 5000 - 5003" identifies 4 bases (inclusive end point) in the mitochondrial DNA in some reference assembly …. if I have understood correctly, and the items of interest to be modeled are variations against that reference assembly. In this model, we may choose to use an address like "chrM: 5000 - 5003" to identify some part of a reference assembly from which the current experimental assembly differs.

In this way of thinking, I am not really interested in an assembly of ChrM for patient XXYZZ's sampleXX, and so Jerven URI to refer to that is not so useful.
I guess I am surprised to see Jerven suggesting a URI in which the assembly is part of the ChrM rather than the other way round.

variants, defaults, non-monotonic reasoning

Part of my problem here is to do with defaults and diffs and knowledge and modeling ….
In general, the smart money likes monotonic reasoning as opposed to non-monotonic reasoning; because of reasoning tractability issues. Defaults, diffs, variants, all tend to non monotonic reasoning, or closed world assumptions or … since if I have not been told that a particular sample's assembly has a variant from the reference assembly at a particular position then I effectively assume that the base in my sample's assembly is the same as the base in the reference assembly. In practice this is then an issue when the quality of the non-variant call is questionable. (see
https://sites.google.com/site/gvcftools/home/about-gvcf
concerning non-variant sites)

My gut feel is that these concerns, while theoretically well-founded, are practically irrelevant - we simply need to engineer our knowledge systems so that we do have 'complete' variant information, and some awareness that any individual call (either variant or non-variant) may be wrong. 'complete' may have a rather parochial system-specific meaning ...

Without the defaults, and the diffs and all the rest, the storage and query tractability issues appear overwhelming …. and so there isn't really any practical choice here.

phases of analysis

Analysis of the raw experiments in sequencing machines takes place in phases; and each phase does in practice need to assume the results of the previous phase; with some awareness of the shades of grey in such assumptions. Each phase essentially passes only 'output' to the next stage, and we cannot, in practice, forever return to the raw data to justify every step at every stage.

practical ? proposal for representing an assembly of a patient's sample

_:sample eg:sampleFromFile <ftp://example.org/mypatientsample.vcf> .

# metadata headers from VCF file, cleaned up somewhat
<ftp://example.org/mypatientsample.vcf> vcf:reference <http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.17/>
<ftp://example.org/mypatientsample.vcf> vcf:fileDate "2012-06-26"^^xsd:date .

# each row of the data of the VCF file becomes something like

_:sample eg:hasVariant [

eg:aboutGenomePosition [
# we use a restricted vocabulary of chromosome names
eg:chromosome eg:chrM ;
eg:startPosition "5000"^^xsd:int ;
eg:endPosition "5003"^^xsd:int ;
eg:referenceSequence _:ref5000 ;
eg:alternateSequence _:alt5000
# more stuff from ID, ALT, QUAL, FILTER and INFO fields of VCF
]

# some mapping of the per-sample field
# e.g. in 1000 genomes data FORMAT=GT:DS:GL 1|0:1.000:-1.69,-0.01,-5.00
# the 1|0 is a phased genotype call
eg:GT [
eg:phase _:p1 ;
eg:gtCall _:alt5000 ;
]
eg:GT [
eg:phase _:p2 ;
eg:gtCall _:ref5000 ;
]
].
_:ref5000 eg:sequence "ACTG" .
_:alt5000 eg:sequence "A" .

Hmmmm, there are a lot of modeling questions in there. The VCF file format has some answers, but not very good ones, partly because the questions do not appear to have been asked as modeling questions.
It seems pretty unclear to me how to include the GL (Genotype Likelihood) values in there. I think these are used to help make the genotype call; and then kept around in case you don't like the call.
The phasing also seems problematic, since it seems that it is generally useful information as to which strand which allele was seen on, (for example for hapliotype identification) but in practice we can't trace a strand all the way through a chromosome.

Further the genotype call may be phased (ordered with respect to genotype calls at at least one other position), or unphased (i.e. an unordered pair); and the two values may be the same or different - the best way to model that is ??? All my ideas seem at least a little awkward.

Or would it be better just to dump this stuff in an RDB, and be done with it.

Jeremy

Received on Thursday, 21 March 2013 22:01:21 UTC