RE: 'Variants' and Chromosome Modelling from Freimuth, Robert R., Ph.D. on 2013-03-26 (public-semweb-lifesci@w3.org from March 2013)

From: Freimuth, Robert R., Ph.D. <Freimuth.Robert@mayo.edu>
Date: Tue, 26 Mar 2013 21:43:59 +0000
To: Jeremy J Carroll <jjc@syapse.com>
CC: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-ID: <76A706C559A90249BA321EE35470B85722EC60C7@MSGPEXCEI12A.mfad.mfroot.org>
Hi Jeremy,

The examples included below provide some of the metadata that describe how the data was generated.  For example, it includes parameters for the sequence analysis steps (including the reference sequence and alignment parameters) and hints at the platform that was used to do the sequencing.

> the community has not (yet) found this information sufficiently useful across organizations

Yes and no.  The metadata is probably not critical when a given VCF file is examined in isolation.  However, anyone that wants to integrate genomic data sets will need this information to ensure they are comparing apples to apples.  If, as is often the case, an orange is thrown in (e.g., a set of data from a different platform), then that may be important to know for the downstream analysis.

> My conclusion is that this metadata has not yet proved important

It will be very important when the data is re-interpreted using an updated reference sequence or set of annotations, for example.  In addition, platforms have differing capabilities to detect certain types of variations (e.g., indels) and may perform better or worse in some situations (e.g., duplicated regions).  If a genome is sequenced on multiple platforms and the data is to be integrated, knowing which calls were from which platforms (with which analysis parameters) may be invaluable for the reconciliation process.

Disclaimer:  The comments above may not apply to your use cases.  If these situations are outside of your requirements, then obviously they don't need to impact your design.

> My point is that the bulk of the VCF file is usable to the extent that I can write a program that uses it according to the specifications, whereas the metadata about the process that created the VCF file is not; it is not much more than some scribbles on the back-of-an-envelope. If I wish to write a program to use these scribbles, I need to find the person who wrote the scribbles and ask them for help.

You are correct, and I sympathize with your struggles.  It certainly is easier to parse the structured portions of the file (there are tools to do this, such as [1]) than the free-text, user-defined bits.  While it may be straightforward to parse out the values, determining the semantic meaning of those values probably requires at least some human interaction.  I'm afraid I don't have a good solution for you, unless you're in a position to influence the generation of the VCF files themselves.  My comments were intended to point out the potential utility of the information only.

Ideally, each data set would come with a rigorously-defined description of its provenance.  Some workflow tools provide this capability, but I doubt you'll find that attached to many VCF files out in the wild.  Perhaps you can work with the producer of the data to agree on an internal standard?  (If you're using 1000 genomes data, it may be possible to do this given the relatively small size of the data set.  I am sure that the community would appreciate the information in a structured format if it were available.)

I'll come back to one of your original questions:

> would it be better just to dump this stuff in an RDB

I would encourage you to think about what portions of the data set need to be expressed in RDF and which do not (as dictated by your use cases).  The selection of any back-end storage system should be guided by existing and anticipated use cases.  If you don't foresee the need to have the metadata in RDF, it may not be a good use of resources to do the conversion and semantic annotation.  If that's the case, a low-effort solution might be to do a high-level parse of the metadata and throw it into an RDB for future use - at least you'll have easy access to it should you need it.  That would also allow you to generate some reports to show how much variability there is in the keys/values, which may help guide any normalization efforts with the data owners.

Hope this helps,
Bob

[1] http://vcftools.sourceforge.net/links.html


________________________________
From: Jeremy J Carroll [mailto:jjc@syapse.com]
Sent: Tuesday, March 26, 2013 3:30 PM
To: Freimuth, Robert R., Ph.D.
Cc: w3c semweb HCLS
Subject: Re: 'Variants' and Chromosome Modelling


Hi Bob

In this message I reply to

It will certainly be important to track metadata about the sequence analysis method, etc.

but not to

In my opinion, the real potential of using semweb technologies with genetic data is in the layers of interpretation that are built from the genetic sequences.

While I know very little about sequencing, from a knowledge engineering perspective I am cautious about your claim that it is "important to track metadata about the sequence analysis method, etc. "., particularly that this is more important than working out how to make use of the contents of the VCF file.

The reason for this, is that the formats already used are motivated by some interoperability requirements, that may not have been articulated.
I see files with such metadata in, and jt may be lines like:

##source_20121101.1=vcf-merge(r731) Broad.LUAD.exome.cleaned.somatic.sorted.vcf.corrected.vcf.gz Broad.LUSC_Paper_v8.aggregated.tcga.somatic.sorted.vcf.corrected.vcf.gz genome.wustl.edu<http://genome.wustl.edu>_BRCA.IlluminaGA_DNASeq.Level_2.5.1.0.somatic.sorted.vcf.corrected.vcf.gz
##sourceFiles_20121101.1=0:Broad.LUAD.exome.cleaned.somatic.sorted.vcf.corrected.vcf.gz,1:Broad.LUSC_Paper_v8.aggregated.tcga.somatic.sorted.vcf.corrected.vcf.gz,2:genome.wustl.edu<http://genome.wustl.edu>_BRCA.IlluminaGA_DNASeq.Level_2.5.1.0.somatic.sorted.vcf.corrected.vcf.gz


or like

##FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/humgen/1kg/reference/human_g1k_v37.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false disable_experimental_low_memory_sharding=false logging_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=./0.451323408008651.sorted.vcf) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub filter_mismatching_base_and_quals=false"

now, obviously somebody, somewhere upstream from me, thought it useful to add this information; but the community has not (yet) found this information sufficiently useful across organizations, to document what any of this means. While looking at it, I can guess, I am not in a position to write any code that can operate on such metadata in a systematic fashion.

My conclusion is that this metadata has not yet proved important.

OTOH, the bread and butter of the VCF file is the variant and sample data. While this is not the most straightforward of formats, there has been community effort to document it, which indicates that it is probably genuinely important to more than one person that I can find the genotype likelihood for  a particular call for a particular sample. For example, by a process of reading the manual, I can find out that
in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz
on the 675th column of the 488th line
the 1|0:1.000:-4.3974,-0.00013033,-3.58503 means that

  *   the most likely call for the 61657th position of chromosome 7 is the ordered pair (A,G),
  *   where the ordering is in phase with other entries in the file,
  *   and the likelihood of that call is -0.00013033 as opposed to -4.3974 for (G,G) and -3.58503 for (A,A)
  *   and the MaCH/Thunder dosage is 1 as described here

http://www.1000genomes.org/faq/what-does-genotype-dosage-mean-phase1-integrated-call-set

My point is that the bulk of the VCF file is usable to the extent that I can write a program that uses it according to the specifications, whereas the metadata about the process that created the VCF file is not; it is not much more than some scribbles on the back-of-an-envelope. If I wish to write a program to use these scribbles, I need to find the person who wrote the scribbles and ask them for help.

A very simple example, is that the reference field is documented, but the legal values are not (or the documentation is not followed) and so a whole range of possibilities comes up - without fixing the reference the file is actually pretty meaningless, and I have to use real-world knowledge and heuristics to know what any one particular system means.

My conclusions are:
- providing systematic support for tracking the metadata about the sequence analysis method is possibly an easy win (current solutions do not seem very good)
- providing easier methods for navigating around the contents of the VCF file is a difficult, but possibly more important, win (current solutions appear to work, but may be limited)

Jeremy



==
Jeremy J Carroll
Principal Architect
Syapse, Inc.



On Mar 26, 2013, at 10:58 AM, "Freimuth, Robert R., Ph.D." <Freimuth.Robert@mayo.edu<mailto:Freimuth.Robert@mayo.edu>> wrote:

> Hmmmm, there are a lot of modeling questions in there.

The adage "all models are wrong, but some are useful" comes to mind.  To answer these questions, you need to define your use cases.  What are you trying to model?  Why?  How is the data going to be used?

Are you trying to model the sequencing and primary data analysis steps?  Or metadata about the sequencing technology and instrument/platform?  Or a physical piece of DNA?  Or the myriad annotations that can be associated with a region of DNA sequence (each with their own provenance)?  Or the current state of our collective knowledge of molecular and cellular biology related to a given DNA sequence?  Or the clinical phenotype (e.g., disease) and/or treatment options that might be related to a particular variant?  Or...

It is possible to create a very intricate model that represents all of this (and more).  However, it is likely unnecessary (unless you've got some monster use cases).

The mol bio/genetics community has spent decades refining ways to express genetic data.  One could create a model for a VCF file, but I'm not sure it would be all that useful.  VCF was developed (in part) to be a compact file format for representing a list of genetic variants.  By definition, it includes only the differences from some reference sequence.  It was not intended to be an accurate model of biology.

I've spent some time thinking about and exploring ways to express genetic data in RDF.  I have yet to find a compelling example where the RDF representation has a significant advantage (and in most cases the opposite is true)..  That said, it is quite possible that someone more proficient in RDF will succeed where I have not, and I look forward to the day if/when that occurs.

> All my ideas seem at least a little awkward.

Indeed.

It will certainly be important to track metadata about the sequence analysis method, etc.  In some cases it will be important to have information about the confidence or quality score for a base call at a given position (most likely to aid reconciliation efforts, when multiple sequences for the same sample are obtained).  Haplotype phasing will also start to become an issue as techniques are developed to determine it experimentally (as opposed to the statistical approaches that are currently used).  I suppose all this could be expressed in RDF, provided there is a use case driving the effort.

In my opinion, the real potential of using semweb technologies with genetic data is in the layers of interpretation that are built from the genetic sequences.  While the underlying genetic sequences can be rebuilt and refined over time, there are plenty of existing tools that can manage this process very efficiently.  Our collective knowledge about those sequences, however, advances continuously.  Changes in our understanding in one area might cascade into others.  We need a way to dynamically update the interpretations and discover novel relationships.  Genetic data (in any format) + biological knowledge (in RDF?) + reasoners could be a powerful combination.

What is the impact of a genetic variant at a given location?  This is a hot field of study within genetic/bioinformatic research, and solutions to this problem will be critical for clinical personalized medicine programs.

Bob


________________________________
From: public-semweb-lifesci-request@listhub.w3.org<mailto:public-semweb-lifesci-request@listhub.w3.org> [mailto:public-semweb-lifesci-request@listhub.w3.org<mailto:semweb-lifesci-request@listhub.w3.org>] On Behalf Of Jeremy J Carroll
Sent: Thursday, March 21, 2013 5:01 PM
To: w3c semweb HCLS
Subject: 'Variants' and Chromosome Modelling


Jerven suggests:

"instead of saying chrM it would have been solved by
using http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5"

rather than continuing the philosophical/theological threads ....
I am interested in this practical question.


chrM as an address

I am wanting to represent bases on chrM, how should do I do this?

My current intent is to continue with the model and the ontology implicit in the VCF format (1000 genomes) and make somewhat more explicit.

In this model "chrM: 5000 - 5003" identifies 4 bases (inclusive end point) in the mitochondrial DNA in some reference assembly .... if I have understood correctly, and the items of interest to be modeled are variations against that reference assembly. In this model, we may choose to use an address like "chrM: 5000 - 5003" to identify some part of a reference assembly from which the current experimental assembly differs.

In this way of thinking, I am not really interested in an assembly of ChrM for patient XXYZZ's sampleXX, and so Jerven URI to refer to that is not so useful.
I guess I am surprised to see Jerven suggesting a URI in which the assembly is part of the ChrM rather than the other way round.


variants, defaults, non-monotonic reasoning

Part of my problem here is to do with defaults and diffs and knowledge and modeling ....
In general, the smart money likes monotonic reasoning as opposed to non-monotonic reasoning; because of reasoning tractability issues. Defaults, diffs, variants, all tend to non monotonic reasoning, or closed world assumptions or ... since if I have not been told that a particular sample's assembly has a variant from the reference assembly at a particular position then I effectively assume that the base in my sample's assembly is the same as the base in the reference assembly. In practice this is then an issue when the quality of the non-variant call is questionable. (see
https://sites.google.com/site/gvcftools/home/about-gvcf
concerning non-variant sites)

My gut feel is that these concerns, while theoretically well-founded, are practically irrelevant - we simply need to engineer our knowledge systems so that we do have 'complete' variant information, and some awareness that any individual call (either variant or non-variant) may be wrong. 'complete' may have a rather parochial system-specific meaning ...

Without the defaults, and the diffs and all the rest, the storage and query tractability issues appear overwhelming .... and so there isn't really any practical choice here.

phases of analysis

Analysis of the raw experiments in sequencing machines takes place in phases; and each phase does in practice need to assume the results of the previous phase; with some awareness of the shades of grey in such assumptions. Each phase essentially passes only 'output' to the next stage, and we cannot, in practice, forever return to the raw data to justify every step at every stage.



practical ? proposal for representing an assembly of a patient's sample


_:sample eg:sampleFromFile   <ftp://example.org/mypatientsample.vcf> .

# metadata headers from VCF file, cleaned up somewhat
<ftp://example.org/mypatientsample.vcf> vcf:reference <http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.17/>
<ftp://example.org/mypatientsample.vcf> vcf:fileDate "2012-06-26"^^xsd:date .

# each row of the data of the VCF file becomes something like

_:sample eg:hasVariant [

   eg:aboutGenomePosition [
# we use a restricted vocabulary of chromosome names
       eg:chromosome eg:chrM ;
       eg:startPosition "5000"^^xsd:int ;
       eg:endPosition "5003"^^xsd:int ;
       eg:referenceSequence _:ref5000 ;
       eg:alternateSequence _:alt5000
 # more stuff from ID, ALT, QUAL, FILTER and INFO fields of VCF
   ]

# some mapping of the per-sample field
# e.g. in 1000 genomes data FORMAT=GT:DS:GL 1|0:1.000:-1.69,-0.01,-5.00
# the 1|0 is a phased genotype call
   eg:GT [
      eg:phase _:p1 ;
      eg:gtCall _:alt5000 ;
   ]
   eg:GT [
      eg:phase _:p2 ;
      eg:gtCall _:ref5000 ;
   ]
].
_:ref5000 eg:sequence "ACTG" .
_:alt5000 eg:sequence "A" .



Hmmmm, there are a lot of modeling questions in there. The VCF file format has some answers, but not very good ones, partly because the questions do not appear to have been asked as modeling questions.
It seems pretty unclear to me how to include the GL (Genotype Likelihood) values in there. I think these are used to help make the genotype call; and then kept around in case you don't like the call.
The phasing also seems problematic, since it seems that it is generally useful information as to which strand which allele was seen on, (for example for hapliotype identification) but in practice we can't trace a strand all the way through a chromosome.

Further the genotype call may be phased (ordered with respect to genotype calls at at least one other position), or unphased (i.e. an unordered pair); and the two values may be the same or different - the best way to model that is ??? All my ideas seem at least a little awkward.

Or would it be better just to dump this stuff in an RDB, and be done with it.

Jeremy
Received on Tuesday, 26 March 2013 21:45:08 UTC