- From: Michel Dumontier <michel.dumontier@gmail.com>
- Date: Thu, 21 Mar 2013 18:31:30 -0400
- To: Jeremy J Carroll <jjc@syapse.com>
- Cc: w3c semweb HCLS <public-semweb-lifesci@w3.org>
- Message-ID: <CALcEXf7WZ1=h9DTHaB_G-caxW-mUwbPF5UEL8xvybHUuvgP=PA@mail.gmail.com>
On Thu, Mar 21, 2013 at 6:00 PM, Jeremy J Carroll <jjc@syapse.com> wrote: > > Jerven suggests: > > "instead of saying chrM it would have been solved by > using > http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5 > " > > rather than continuing the philosophical/theological threads …. > I am interested in this practical question. > > > *chrM as an address* > > I am wanting to represent bases on chrM, how should do I do this? > > My current intent is to continue with the model and the ontology implicit > in the VCF format (1000 genomes) and make somewhat more explicit. > > In this model "chrM: 5000 - 5003" identifies 4 bases (inclusive end point) > in the mitochondrial DNA in some reference assembly …. if I have understood > correctly, and the items of interest to be modeled are variations against > that reference assembly. In this model, we may choose to use an address > like "chrM: 5000 - 5003" to identify some part of a reference assembly from > which the current experimental assembly differs. > > In this way of thinking, I am not really interested in an assembly of ChrM > for patient XXYZZ's sampleXX, and so Jerven URI to refer to that is not so > useful. > I guess I am surprised to see Jerven suggesting a URI in which the > assembly is part of the ChrM rather than the other way round. > > > *variants, defaults, non-monotonic reasoning* > > Part of my problem here is to do with defaults and diffs and knowledge and > modeling …. > In general, the smart money likes monotonic reasoning as opposed to > non-monotonic reasoning; because of reasoning tractability issues. > Defaults, diffs, variants, all tend to non monotonic reasoning, or closed > world assumptions or … since if I have not been told that a particular > sample's assembly has a variant from the reference assembly at a particular > position then I effectively assume that the base in my sample's assembly is > the same as the base in the reference assembly. In practice this is then an > issue when the quality of the non-variant call is questionable. (see > https://sites.google.com/site/gvcftools/home/about-gvcf > concerning non-variant sites) > > My gut feel is that these concerns, while theoretically well-founded, are > practically irrelevant - we simply need to engineer our knowledge systems > so that we do have 'complete' variant information, and some awareness that > any individual call (either variant or non-variant) may be wrong. > 'complete' may have a rather parochial system-specific meaning ... > > Without the defaults, and the diffs and all the rest, the storage and > query tractability issues appear overwhelming …. and so there isn't really > any practical choice here. > > *phases of analysis* > > Analysis of the raw experiments in sequencing machines takes place in > phases; and each phase does in practice need to assume the results of the > previous phase; with some awareness of the shades of grey in such > assumptions. Each phase essentially passes only 'output' to the next stage, > and we cannot, in practice, forever return to the raw data to justify every > step at every stage. > > > > *practical ? proposal for representing an assembly of a patient's sample* > > > _:sample eg:sampleFromFile <ftp://example.org/mypatientsample.vcf> . > > # metadata headers from VCF file, cleaned up somewhat > <ftp://example.org/mypatientsample.vcf> vcf:reference < > http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.17/> > <ftp://example.org/mypatientsample.vcf> vcf:fileDate "2012-06-26"^^xsd:date > . > > # each row of the data of the VCF file becomes something like > > _:sample eg:hasVariant [ > > // linked data principle: no blank nodes a :Variant; // type your data > eg:aboutGenomePosition [ > why add this relation? why not link the data to the :Variant object directly? # we use a restricted vocabulary of chromosome names > eg:chromosome eg:chrM ; > eg:startPosition "5000"^^xsd:int ; > eg:endPosition "5003"^^xsd:int ; > eg:referenceSequence _:ref5000 ; > eg:alternateSequence _:alt5000 > fine. > # more stuff from ID, ALT, QUAL, FILTER and INFO fields of VCF > ok, fill in. > ] > > # some mapping of the per-sample field > # e.g. in 1000 genomes data FORMAT=GT:DS:GL 1|0:1.000:-1.69,-0.01,-5.00 > # the 1|0 is a phased genotype call > eg:GT [ > eg:phase _:p1 ; > eg:gtCall _:alt5000 ; > ] > eg:GT [ > eg:phase _:p2 ; > eg:gtCall _:ref5000 ; > ] > ]. > _:ref5000 eg:sequence "ACTG" . > _:alt5000 eg:sequence "A" . > > > why not put the genotype call data on the sequence data? :ref5000 a :Sequence; rdf:value "ACTG"; :genotype-call :callx . :callx a :Genotype-Call; :phase :p2; .. m. > > Hmmmm, there are a lot of modeling questions in there. The VCF file format > has some answers, but not very good ones, partly because the questions do > not appear to have been asked as modeling questions. > It seems pretty unclear to me how to include the GL (Genotype Likelihood) > values in there. I think these are used to help make the genotype call; and > then kept around in case you don't like the call. > The phasing also seems problematic, since it seems that it is generally > useful information as to which strand which allele was seen on, (for > example for hapliotype identification) but in practice we can't trace a > strand all the way through a chromosome. > > Further the genotype call may be phased (ordered with respect to genotype > calls at at least one other position), or unphased (i.e. an unordered > pair); and the two values may be the same or different - the best way to > model that is ??? All my ideas seem at least a little awkward. > > Or would it be better just to dump this stuff in an RDB, and be done with > it. > > Jeremy > > > > > > -- Michel Dumontier Associate Professor of Bioinformatics, Carleton University Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group http://dumontierlab.com
Received on Thursday, 21 March 2013 22:32:21 UTC