Re: 'Variants' and Chromosome Modelling

On Thu, Mar 21, 2013 at 6:00 PM, Jeremy J Carroll <jjc@syapse.com> wrote:

>
> Jerven suggests:
>
> "instead of saying chrM it would have been solved by
> using
> http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5
> "
>
> rather than continuing the philosophical/theological threads ….
> I am interested in this practical question.
>
>
> *chrM as an address*
>
> I am wanting to represent bases on chrM, how should do I do this?
>
> My current intent is to continue with the model and the ontology implicit
> in the VCF format (1000 genomes) and make somewhat more explicit.
>
> In this model "chrM: 5000 - 5003" identifies 4 bases (inclusive end point)
> in the mitochondrial DNA in some reference assembly …. if I have understood
> correctly, and the items of interest to be modeled are variations against
> that reference assembly. In this model, we may choose to use an address
> like "chrM: 5000 - 5003" to identify some part of a reference assembly from
> which the current experimental assembly differs.
>
> In this way of thinking, I am not really interested in an assembly of ChrM
> for patient XXYZZ's sampleXX, and so Jerven URI to refer to that is not so
> useful.
> I guess I am surprised to see Jerven suggesting a URI in which the
> assembly is part of the ChrM rather than the other way round.
>
>
> *variants, defaults, non-monotonic reasoning*
>
> Part of my problem here is to do with defaults and diffs and knowledge and
> modeling ….
> In general, the smart money likes monotonic reasoning as opposed to
> non-monotonic reasoning; because of reasoning tractability issues.
> Defaults, diffs, variants, all tend to non monotonic reasoning, or closed
> world assumptions or … since if I have not been told that a particular
> sample's assembly has a variant from the reference assembly at a particular
> position then I effectively assume that the base in my sample's assembly is
> the same as the base in the reference assembly. In practice this is then an
> issue when the quality of the non-variant call is questionable. (see
> https://sites.google.com/site/gvcftools/home/about-gvcf
> concerning non-variant sites)
>
> My gut feel is that these concerns, while theoretically well-founded, are
> practically irrelevant - we simply need to engineer our knowledge systems
> so that we do have 'complete' variant information, and some awareness that
> any individual call (either variant or non-variant) may be wrong.
> 'complete' may have a rather parochial system-specific meaning ...
>
> Without the defaults, and the diffs and all the rest, the storage and
> query tractability issues appear overwhelming …. and so there isn't really
> any practical choice here.
>
> *phases of analysis*
>
> Analysis of the raw experiments in sequencing machines takes place in
> phases; and each phase does in practice need to assume the results of the
> previous phase; with some awareness of the shades of grey in such
> assumptions. Each phase essentially passes only 'output' to the next stage,
> and we cannot, in practice, forever return to the raw data to justify every
> step at every stage.
>
>
>
> *practical ? proposal for representing an assembly of a patient's sample*
>
>
> _:sample eg:sampleFromFile   <ftp://example.org/mypatientsample.vcf> .
>
> # metadata headers from VCF file, cleaned up somewhat
> <ftp://example.org/mypatientsample.vcf> vcf:reference <
> http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.17/>
> <ftp://example.org/mypatientsample.vcf> vcf:fileDate "2012-06-26"^^xsd:date
> .
>
> # each row of the data of the VCF file becomes something like
>
> _:sample eg:hasVariant [
>
> // linked data principle: no blank nodes

 a :Variant;  // type your data


>    eg:aboutGenomePosition [
>
why add this relation? why not link the data to the :Variant object
directly?

# we use a restricted vocabulary of chromosome names
>        eg:chromosome eg:chrM ;
>        eg:startPosition "5000"^^xsd:int ;
>        eg:endPosition "5003"^^xsd:int ;
>        eg:referenceSequence _:ref5000 ;
>        eg:alternateSequence _:alt5000
>
fine.


>  # more stuff from ID, ALT, QUAL, FILTER and INFO fields of VCF
>
ok, fill in.


>    ]
>
> # some mapping of the per-sample field
> # e.g. in 1000 genomes data FORMAT=GT:DS:GL 1|0:1.000:-1.69,-0.01,-5.00
> # the 1|0 is a phased genotype call
>    eg:GT [
>
      eg:phase _:p1 ;
>       eg:gtCall _:alt5000 ;
>    ]
>    eg:GT [
>       eg:phase _:p2 ;
>       eg:gtCall _:ref5000 ;
>    ]
> ].
> _:ref5000 eg:sequence "ACTG" .
> _:alt5000 eg:sequence "A" .
>
>
>
why not put the genotype call data on the sequence data?

:ref5000
  a :Sequence;
  rdf:value "ACTG";
  :genotype-call :callx .

:callx
  a :Genotype-Call;
  :phase :p2;

..

m.



>
> Hmmmm, there are a lot of modeling questions in there. The VCF file format
> has some answers, but not very good ones, partly because the questions do
> not appear to have been asked as modeling questions.
> It seems pretty unclear to me how to include the GL (Genotype Likelihood)
> values in there. I think these are used to help make the genotype call; and
> then kept around in case you don't like the call.
> The phasing also seems problematic, since it seems that it is generally
> useful information as to which strand which allele was seen on, (for
> example for hapliotype identification) but in practice we can't trace a
> strand all the way through a chromosome.
>
> Further the genotype call may be phased (ordered with respect to genotype
> calls at at least one other position), or unphased (i.e. an unordered
> pair); and the two values may be the same or different - the best way to
> model that is ??? All my ideas seem at least a little awkward.
>
> Or would it be better just to dump this stuff in an RDB, and be done with
> it.
>
> Jeremy
>
>
>
>
>
>


-- 
Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com

Received on Thursday, 21 March 2013 22:32:21 UTC