Re: VCF and RDF, at Clinical Pharmacogenomics TF, Wed Apr 3rd from Jeremy J Carroll on 2013-04-02 (public-semweb-lifesci@w3.org from April 2013)

From: Jeremy J Carroll <jjc@syapse.com>
Date: Tue, 2 Apr 2013 14:21:33 -0700
To: Michael Miller <Michael.Miller@systemsbiology.org>
Cc: Kingsley Idehen <kidehen@openlinksw.com>, HCLS hcls <public-semweb-lifesci@w3.org>, Chris Mungall <cjmungall@lbl.gov>
Message-Id: <5C98FB9C-0E7E-4412-A1D6-2318D512F55D@syapse.com>
Hi Michael

I am curious … about "use a base
class to handle most of the parsing then have a derived class per provider
and try to drive everything through a configuration file"

I have not yet seen enough variety of VCF files, but so far I am assuming that the data lines can largely be addressed by reading the INFO, FORMAT and FILTER definitions at the top of the file, and handling appropriately.
There still seems a little way to go before that is fully automatable, because of overly smart definitions such as:

##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">

##INFO=<ID=EA_AC,Number=.,Type=String,Description="European American Allele Count in the order of AltAlleles,RefAllele. For INDELs, A1, A2, or An refers to the N-th alternate allele while R refers to the reference allele.">

##INFO=<ID=MAF,Number=.,Type=String,Description="Minor Allele Frequency in percent in the order of EA,AA,All">
##INFO=<ID=GTS,Number=.,Type=String,Description="Observed Genotypes. For INDELs, A1, A2, or An refers to the N-th alternate allele while R refers to the reference allele.">
##INFO=<ID=EA_GTC,Number=.,Type=String,Description="European American Genotype Counts in the order of listed GTS">

which introduce non-standard parsing rules. 

I am hoping that there are relatively few of them, and one can do a reasonable job of recognizing key phases: like the enum in the first seems something which could be automatically detected, and the  AltAlleles,RefAllele idiom seems

To me it seems that the intent is that the header declarations gives you enough to understand the content, and I imagine that there is in practice sufficiently little variation for that to be automatable ...


Jeremy J Carroll
Principal Architect
Syapse, Inc.



On Apr 1, 2013, at 4:24 PM, Michael Miller <Michael.Miller@systemsbiology.org> wrote:

> hi jeremy,
> 
> unfortunately, it isn't just the metadata that can be different between
> different producers.
> 
> for a project i'm working on, similar in scope to 1000 genomes, variants
> at the same location for different subjects are collapsed into one VCF
> record and there is a list of possible alleles, one per collapsed variant.
> i'm not working directly with the VCF but there is also an annotation
> field to tell whether the variant is in a coding region, an exon or an
> intron, or outside a coding region, whether the change makes a difference
> in the amino acid, plus that all depends on which isoforms are possible.
> even tho it is reasonably documented, there still are gotcha's.
> 
> the one thing i have found tho, is that one can base one's mapping on the
> source of the document.  two vcf's from the same provider will likely be
> parsable by the same code.  so i usually do what most would do, use a base
> class to handle most of the parsing then have a derived class per provider
> and try to drive everything through a configuration file.
> 
> cheers,
> michael
> 
> Michael Miller
> Software Engineer
> Institute for Systems Biology
> 
>> -----Original Message-----
>> From: Jeremy J Carroll [mailto:jjc@syapse.com]
>> Sent: Monday, April 01, 2013 2:02 PM
>> To: Chris Mungall
>> Cc: Kingsley Idehen; HCLS hcls
>> Subject: Re: VCF and RDF, at Clinical Pharmacogenomics TF, Wed Apr 3rd
>> 
>> 
>> This looks really helpful ..
>> 
>> I suspect that the key questions remain about how to effectively use the
>> data
>> 
>> FALDO seems at first glance to be fairly simplistic compared with the
> model
>> implicit in VCF.
>> Maybe FALDO is actually more useful, concentrating on the important
> stuff
>> ...
>> 
>> thanks
>> 
>> 
>> Jeremy J Carroll
>> Principal Architect
>> Syapse, Inc.
>> 
>> 
>> 
>> On Apr 1, 2013, at 12:40 PM, Chris Mungall <cjmungall@lbl.gov> wrote:
>> 
>>> 
>>> Apologies if this has been covered already I haven't been following
> the
>> whole discussion.
>>> 
>>> Genome variant data is just a subset of genome data. My understanding
> is
>> that the semweb BioHackathon group looked at a variety of different
> kinds
>> of genomic data and came up with FALDO[1]. This model looks pretty good
>> to me, and importantly there is a converter from GFF3[2,3]. Of all the
>> commonly used genome feature formats out there, GFF3 is by far the best
> at
>> encouraging provision of relevant metadata using standard
>> ontologies/terminologies.
>>> 
>>> VCF is convertible to GVF[4,5] which is a subset of GFF3 with
> additional
>> recommended metadata. It's supported by Ensembl, gbGap and others, and
>> the 1000genomes data is available in GVF[6].
>>> 
>>> As GFF3 is convertible to RDF/OWL that uses FALDO and SO, it follows
> that
>> GVF is too (though the converter may need tweaking to take advantage of
>> the additional GVF metadata).
>>> 
>>> I just wanted to make sure you were aware of all this previous work
>> before reinventing anything.
>>> 
>>> [1] https://github.com/JervenBolleman/FALDO
>>> [2] http://www.sequenceontology.org/gff3.shtml
>>> [3] https://code.google.com/p/gff3-to-owl/
>>> [4] http://www.ncbi.nlm.nih.gov/pubmed/20796305 - A standard variation
>> file format for human genome sequences - Reese at al
>>> [5] http://www.sequenceontology.org/resources/gvf.html
>>> [6] ftp://ftp.ensembl.org/pub/current_variation/gvf/homo_sapiens/
>>> 
>>> On Apr 1, 2013, at 10:59 AM, Jeremy J Carroll wrote:
>>> 
>>>> Hi Kingsley,
>>>> 
>>>> I wasn't going to but since you ask:
>>>> 
>>>> http://www.slideshare.net/JeremyJCarroll/vcf-and-rdf
>>>> 
>>>> or
>>>> 
>>>> http://lists.w3.org/Archives/Public/www-archive/2013Apr/att-
>> 0002/W3C-JJC-LifeSci.pdf
>>>> 
>>>> 
>>>> Jeremy J Carroll
>>>> Principal Architect
>>>> Syapse, Inc.
>>>> 
>>>> 
>>>> 
>>>> On Apr 1, 2013, at 10:13 AM, Kingsley Idehen
>> <kidehen@openlinksw.com> wrote:
>>>> 
>>>>> On 4/1/13 1:05 PM, Jeremy J Carroll wrote:
>>>>>> Hi
>>>>>> 
>>>>>> I am hoping to present the work I am currently doing on VCF and RDF
> at
>> the Clinical Pharamcogenomics TF telecom on Wednesday.
>>>>>> 
>>>>>> My presentation should cover:
>>>>>> 
>>>>>> - business background, Syapse Discovery
>>>>>> - some background on VCF as a knowledge representation format
>>>>>> - and some initial results on mapping 1000 genomes into RDF
>>>>>> 
>>>>>> I will circulate slides shortly
>>>>>> 
>>>>>> 
>>>>>> Jeremy J Carroll
>>>>>> Principal Architect
>>>>>> Syapse, Inc.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> Hopefully you'll publish to Slideshare?
>>>>> 
>>>>> --
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Kingsley Idehen
>>>>> Founder & CEO
>>>>> OpenLink Software
>>>>> Company Web: http://www.openlinksw.com
>>>>> Personal Weblog: http://www.openlinksw.com/blog/~kidehen
>>>>> Twitter/Identi.ca handle: @kidehen
>>>>> Google+ Profile:
>> https://plus.google.com/112399767740508618350/about
>>>>> LinkedIn Profile: http://www.linkedin.com/in/kidehen
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>
Received on Tuesday, 2 April 2013 21:22:07 UTC