Re: 'Variants' and Chromosome Modelling from Jeremy J Carroll on 2013-03-26 (public-semweb-lifesci@w3.org from March 2013)

From: Jeremy J Carroll <jjc@syapse.com>
Date: Tue, 26 Mar 2013 13:30:06 -0700
To: "Freimuth, Robert R., Ph.D." <Freimuth.Robert@mayo.edu>
Cc: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-Id: <71C208E6-9897-4B20-B0C6-0F35704B99F2@syapse.com>
Hi Bob

In this message I reply to

> It will certainly be important to track metadata about the sequence analysis method, etc.

but not to

> In my opinion, the real potential of using semweb technologies with genetic data is in the layers of interpretation that are built from the genetic sequences.  


While I know very little about sequencing, from a knowledge engineering perspective I am cautious about your claim that it is "important to track metadata about the sequence analysis method, etc. "., particularly that this is more important than working out how to make use of the contents of the VCF file.

The reason for this, is that the formats already used are motivated by some interoperability requirements, that may not have been articulated.
I see files with such metadata in, and jt may be lines like:

##source_20121101.1=vcf-merge(r731) Broad.LUAD.exome.cleaned.somatic.sorted.vcf.corrected.vcf.gz Broad.LUSC_Paper_v8.aggregated.tcga.somatic.sorted.vcf.corrected.vcf.gz genome.wustl.edu_BRCA.IlluminaGA_DNASeq.Level_2.5.1.0.somatic.sorted.vcf.corrected.vcf.gz
##sourceFiles_20121101.1=0:Broad.LUAD.exome.cleaned.somatic.sorted.vcf.corrected.vcf.gz,1:Broad.LUSC_Paper_v8.aggregated.tcga.somatic.sorted.vcf.corrected.vcf.gz,2:genome.wustl.edu_BRCA.IlluminaGA_DNASeq.Level_2.5.1.0.somatic.sorted.vcf.corrected.vcf.gz


or like

##FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/humgen/1kg/reference/human_g1k_v37.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false disable_experimental_low_memory_sharding=false logging_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=./0.451323408008651.sorted.vcf) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub filter_mismatching_base_and_quals=false"

now, obviously somebody, somewhere upstream from me, thought it useful to add this information; but the community has not (yet) found this information sufficiently useful across organizations, to document what any of this means. While looking at it, I can guess, I am not in a position to write any code that can operate on such metadata in a systematic fashion.

My conclusion is that this metadata has not yet proved important.

OTOH, the bread and butter of the VCF file is the variant and sample data. While this is not the most straightforward of formats, there has been community effort to document it, which indicates that it is probably genuinely important to more than one person that I can find the genotype likelihood for  a particular call for a particular sample. For example, by a process of reading the manual, I can find out that 
in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz
on the 675th column of the 488th line
the 1|0:1.000:-4.3974,-0.00013033,-3.58503 means that 
the most likely call for the 61657th position of chromosome 7 is the ordered pair (A,G), 
where the ordering is in phase with other entries in the file, 
and the likelihood of that call is -0.00013033 as opposed to -4.3974 for (G,G) and -3.58503 for (A,A)
and the MaCH/Thunder dosage is 1 as described here
http://www.1000genomes.org/faq/what-does-genotype-dosage-mean-phase1-integrated-call-set

My point is that the bulk of the VCF file is usable to the extent that I can write a program that uses it according to the specifications, whereas the metadata about the process that created the VCF file is not; it is not much more than some scribbles on the back-of-an-envelope. If I wish to write a program to use these scribbles, I need to find the person who wrote the scribbles and ask them for help.

A very simple example, is that the reference field is documented, but the legal values are not (or the documentation is not followed) and so a whole range of possibilities comes up - without fixing the reference the file is actually pretty meaningless, and I have to use real-world knowledge and heuristics to know what any one particular system means.

My conclusions are:
- providing systematic support for tracking the metadata about the sequence analysis method is possibly an easy win (current solutions do not seem very good)
- providing easier methods for navigating around the contents of the VCF file is a difficult, but possibly more important, win (current solutions appear to work, but may be limited)

Jeremy



==
Jeremy J Carroll
Principal Architect
Syapse, Inc.



On Mar 26, 2013, at 10:58 AM, "Freimuth, Robert R., Ph.D." <Freimuth.Robert@mayo.edu> wrote:

> > Hmmmm, there are a lot of modeling questions in there.
>  
> The adage "all models are wrong, but some are useful" comes to mind.  To answer these questions, you need to define your use cases.  What are you trying to model?  Why?  How is the data going to be used?
>  
> Are you trying to model the sequencing and primary data analysis steps?  Or metadata about the sequencing technology and instrument/platform?  Or a physical piece of DNA?  Or the myriad annotations that can be associated with a region of DNA sequence (each with their own provenance)?  Or the current state of our collective knowledge of molecular and cellular biology related to a given DNA sequence?  Or the clinical phenotype (e.g., disease) and/or treatment options that might be related to a particular variant?  Or...
>  
> It is possible to create a very intricate model that represents all of this (and more).  However, it is likely unnecessary (unless you've got some monster use cases).
>  
> The mol bio/genetics community has spent decades refining ways to express genetic data.  One could create a model for a VCF file, but I'm not sure it would be all that useful.  VCF was developed (in part) to be a compact file format for representing a list of genetic variants.  By definition, it includes only the differences from some reference sequence.  It was not intended to be an accurate model of biology.
>  
> I've spent some time thinking about and exploring ways to express genetic data in RDF.  I have yet to find a compelling example where the RDF representation has a significant advantage (and in most cases the opposite is true).  That said, it is quite possible that someone more proficient in RDF will succeed where I have not, and I look forward to the day if/when that occurs.
>  
> > All my ideas seem at least a little awkward.
>  
> Indeed.
>  
> It will certainly be important to track metadata about the sequence analysis method, etc.  In some cases it will be important to have information about the confidence or quality score for a base call at a given position (most likely to aid reconciliation efforts, when multiple sequences for the same sample are obtained).  Haplotype phasing will also start to become an issue as techniques are developed to determine it experimentally (as opposed to the statistical approaches that are currently used).  I suppose all this could be expressed in RDF, provided there is a use case driving the effort.
>  
> In my opinion, the real potential of using semweb technologies with genetic data is in the layers of interpretation that are built from the genetic sequences.  While the underlying genetic sequences can be rebuilt and refined over time, there are plenty of existing tools that can manage this process very efficiently.  Our collective knowledge about those sequences, however, advances continuously.  Changes in our understanding in one area might cascade into others.  We need a way to dynamically update the interpretations and discover novel relationships.  Genetic data (in any format) + biological knowledge (in RDF?) + reasoners could be a powerful combination.
>  
> What is the impact of a genetic variant at a given location?  This is a hot field of study within genetic/bioinformatic research, and solutions to this problem will be critical for clinical personalized medicine programs.
> 
> Bob
>  
> 
> From: public-semweb-lifesci-request@listhub.w3.org [mailto:public-semweb-lifesci-request@listhub.w3.org] On Behalf Of Jeremy J Carroll
> Sent: Thursday, March 21, 2013 5:01 PM
> To: w3c semweb HCLS
> Subject: 'Variants' and Chromosome Modelling
> 
> 
> Jerven suggests:
> 
> "instead of saying chrM it would have been solved by
> using http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5"
> 
> rather than continuing the philosophical/theological threads ….
> I am interested in this practical question.
> 
> 
> chrM as an address
> 
> I am wanting to represent bases on chrM, how should do I do this?
> 
> My current intent is to continue with the model and the ontology implicit in the VCF format (1000 genomes) and make somewhat more explicit.
> 
> In this model "chrM: 5000 - 5003" identifies 4 bases (inclusive end point) in the mitochondrial DNA in some reference assembly …. if I have understood correctly, and the items of interest to be modeled are variations against that reference assembly. In this model, we may choose to use an address like "chrM: 5000 - 5003" to identify some part of a reference assembly from which the current experimental assembly differs.
> 
> In this way of thinking, I am not really interested in an assembly of ChrM for patient XXYZZ's sampleXX, and so Jerven URI to refer to that is not so useful.
> I guess I am surprised to see Jerven suggesting a URI in which the assembly is part of the ChrM rather than the other way round.
> 
> 
> variants, defaults, non-monotonic reasoning
> 
> Part of my problem here is to do with defaults and diffs and knowledge and modeling ….
> In general, the smart money likes monotonic reasoning as opposed to non-monotonic reasoning; because of reasoning tractability issues. Defaults, diffs, variants, all tend to non monotonic reasoning, or closed world assumptions or … since if I have not been told that a particular sample's assembly has a variant from the reference assembly at a particular position then I effectively assume that the base in my sample's assembly is the same as the base in the reference assembly. In practice this is then an issue when the quality of the non-variant call is questionable. (see
> https://sites.google.com/site/gvcftools/home/about-gvcf
> concerning non-variant sites)
> 
> My gut feel is that these concerns, while theoretically well-founded, are practically irrelevant - we simply need to engineer our knowledge systems so that we do have 'complete' variant information, and some awareness that any individual call (either variant or non-variant) may be wrong. 'complete' may have a rather parochial system-specific meaning ...
> 
> Without the defaults, and the diffs and all the rest, the storage and query tractability issues appear overwhelming …. and so there isn't really any practical choice here.
> 
> phases of analysis
> 
> Analysis of the raw experiments in sequencing machines takes place in phases; and each phase does in practice need to assume the results of the previous phase; with some awareness of the shades of grey in such assumptions. Each phase essentially passes only 'output' to the next stage, and we cannot, in practice, forever return to the raw data to justify every step at every stage.
> 
> 
> 
> practical ? proposal for representing an assembly of a patient's sample
> 
> 
> _:sample eg:sampleFromFile   <ftp://example.org/mypatientsample.vcf> .
> 
> # metadata headers from VCF file, cleaned up somewhat
> <ftp://example.org/mypatientsample.vcf> vcf:reference <http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.17/>
> <ftp://example.org/mypatientsample.vcf> vcf:fileDate "2012-06-26"^^xsd:date .
> 
> # each row of the data of the VCF file becomes something like
> 
> _:sample eg:hasVariant [
> 
>    eg:aboutGenomePosition [ 
> # we use a restricted vocabulary of chromosome names
>        eg:chromosome eg:chrM ;
>        eg:startPosition "5000"^^xsd:int ;
>        eg:endPosition "5003"^^xsd:int ;
>        eg:referenceSequence _:ref5000 ;
>        eg:alternateSequence _:alt5000
>  # more stuff from ID, ALT, QUAL, FILTER and INFO fields of VCF
>    ]
> 
> # some mapping of the per-sample field
> # e.g. in 1000 genomes data FORMAT=GT:DS:GL 1|0:1.000:-1.69,-0.01,-5.00
> # the 1|0 is a phased genotype call
>    eg:GT [
>       eg:phase _:p1 ;
>       eg:gtCall _:alt5000 ;
>    ]
>    eg:GT [
>       eg:phase _:p2 ;
>       eg:gtCall _:ref5000 ;
>    ]
> ]. 
> _:ref5000 eg:sequence "ACTG" .
> _:alt5000 eg:sequence "A" .
> 
> 
> 
> Hmmmm, there are a lot of modeling questions in there. The VCF file format has some answers, but not very good ones, partly because the questions do not appear to have been asked as modeling questions.
> It seems pretty unclear to me how to include the GL (Genotype Likelihood) values in there. I think these are used to help make the genotype call; and then kept around in case you don't like the call.
> The phasing also seems problematic, since it seems that it is generally useful information as to which strand which allele was seen on, (for example for hapliotype identification) but in practice we can't trace a strand all the way through a chromosome.
> 
> Further the genotype call may be phased (ordered with respect to genotype calls at at least one other position), or unphased (i.e. an unordered pair); and the two values may be the same or different - the best way to model that is ??? All my ideas seem at least a little awkward.
> 
> Or would it be better just to dump this stuff in an RDB, and be done with it.
> 
> Jeremy
> 
> 
> 
> 
>
Received on Tuesday, 26 March 2013 20:30:36 UTC