RE: VCF and RDF, at Clinical Pharmacogenomics TF, Wed Apr 3rd from Michael Miller on 2013-04-04 (public-semweb-lifesci@w3.org from April 2013)

From: Michael Miller <Michael.Miller@systemsbiology.org>
Date: Thu, 4 Apr 2013 08:44:12 -0700
To: Jeremy J Carroll <jjc@syapse.com>
Cc: Kingsley Idehen <kidehen@openlinksw.com>, HCLS hcls <public-semweb-lifesci@w3.org>, Chris Mungall <cjmungall@lbl.gov>
Message-ID: <4ca98f4b8125377369a6efd3405d3d3f@mail.gmail.com>
hi jeremy,



i think the answer is where your company is in the processing of the data
as a middleware provider.  i didn't get a clear idea from your website,
probably because, depending on the contracting company, you provide
different services.



with just the raw variant calls, i don't think there is enough to work
with.  in the best of all possible worlds, yes it would be nice to turn it
into RDF, run a reasoner against other, interesting datastores like
uniprot, entrez gene and the RDF stores they link out to but at several
million variants i don't think that will be effective.



RDF can still be useful at a higher level, in tracking all the pieces of a
contract, stage of research, current results, internal and external
resources (like people and data-mined papers of interest linked to the
hypothesis, ...), and so on.



so i think where RDF can become useful here is after some initial analysis
has been preformed to, as i said, discover some variants of interest to
concentrate on where the results of the analysis can be incorporated into
the RDF and then be joined with the linked data cloud.



i suspect that, in what your company does, there is that sweet spot to jump
in, whether your company gets the partly analyzed data from your contract
or you do it yourself.



oh, and by the way, as far as quality control information, i'm not sure
that is worth including in the RDF.  i've seen that mostly used to either
include or exclude a variant from analysis.  especially after the analysis,
it becomes even less useful since that analysis will not include any
results it didn't think passed QC



cheers,

michael



*From:* Jeremy J Carroll [mailto:jjc@syapse.com]
*Sent:* Wednesday, April 03, 2013 3:39 PM
*To:* Michael Miller
*Cc:* Kingsley Idehen; HCLS hcls; Chris Mungall
*Subject:* Re: VCF and RDF, at Clinical Pharmacogenomics TF, Wed Apr 3rd





Thanks Michael



this is helpful, we are thinking about various ways of selecting what gets
mapped …



While it is blindingly obvious that to each person looking at a VCF file "the
vast majority are going to turn out to be uninteresting" the difficulty I
am having is whether we (as essentially a middleware provider) can make the
call as to which are the boring bits, or will it vary wildly from person to
person.



I mapped the whole of my initial sample files but then threw away the
results.

Motives:

a) get a feel for the size of the problem

b) check that there wasn't some input half way through that I hadn't coded
for

c) make sure that I had some understanding of all the data



Jeremy J Carroll

Principal Architect

Syapse, Inc.







On Apr 3, 2013, at 1:20 PM, Michael Miller <
Michael.Miller@systemsbiology.org> wrote:



hi jeremy,



sorry i missed your talk this morning, it was early on the west coast.



'I am assuming that the data lines can largely be addressed by reading the
INFO, FORMAT and FILTER definitions at the top of the file, and handling
appropriately'



yes, certainly for the standard annotations but providers aren't bound by
what's in the spec for the INFO and the FILTERing can vary, so between two
different providers, the info tags provided could be entirely different,
which is allowed by the VCF 4.1 spec.  some of those may be equivalent
between the two providers so might be mapped to the same RDF, others not
and the mapping would be specific to one or the other provider.  i just
find that it keeps my code saner by having a subclass per provider.  the
base class still takes care of the majority of the parsing and RDF
production but the subclasses do provider specific tasks.



i'm curious why you are interested in mapping all the variants in the VCF
file, the vast majority are going to turn out to be uninteresting.  of
course the trick is how does one determine this.  we are primarily first
filter to protein coding variants (yes ENCODE has 'shown' that 90+% of the
genome is coding in some sense but...) which cuts the variants down to a
couple hundred thousand from seven and a half million.  one can even then
decide to stick with the variants associated with the top hits from
whatever uni- or multi-variant analysis is preformed.  those variants can
then form a basis to go back to the filtered variants to look for
similarity.



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology





*From:* Jeremy J Carroll [mailto:jjc@syapse.com]
*Sent:* Tuesday, April 02, 2013 2:22 PM
*To:* Michael Miller
*Cc:* Kingsley Idehen; HCLS hcls; Chris Mungall
*Subject:* Re: VCF and RDF, at Clinical Pharmacogenomics TF, Wed Apr 3rd





Hi Michael



I am curious … about "use a base

class to handle most of the parsing then have a derived class per provider
and try to drive everything through a configuration file"



I have not yet seen enough variety of VCF files, but so far I am assuming
that the data lines can largely be addressed by reading the INFO, FORMAT
and FILTER definitions at the top of the file, and handling appropriately.

There still seems a little way to go before that is fully automatable,
because of overly smart definitions such as:



##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0
- unspecified, 1 - Germline, 2 - Somatic, 3 - Both">



##INFO=<ID=EA_AC,Number=.,Type=String,Description="European American Allele
Count in the order of AltAlleles,RefAllele. For INDELs, A1, A2, or An
refers to the N-th alternate allele while R refers to the reference
allele.">



##INFO=<ID=MAF,Number=.,Type=String,Description="Minor Allele
Frequency in percent in the order of EA,AA,All">

##INFO=<ID=GTS,Number=.,Type=String,Description="Observed Genotypes.
For INDELs, A1, A2, or An refers to the N-th alternate allele while R
refers to the reference allele.">

##INFO=<ID=EA_GTC,Number=.,Type=String,Description="European American
Genotype Counts in the order of listed GTS">



which introduce non-standard parsing rules.



I am hoping that there are relatively few of them, and one can do a
reasonable job of recognizing key phases: like the enum in the first seems
something which could be automatically detected, and
the  AltAlleles,RefAllele idiom seems



To me it seems that the intent is that the header declarations gives you
enough to understand the content, and I imagine that there is in practice
sufficiently little variation for that to be automatable ...





Jeremy J Carroll

Principal Architect

Syapse, Inc.







On Apr 1, 2013, at 4:24 PM, Michael Miller <
Michael.Miller@systemsbiology.org> wrote:



hi jeremy,

unfortunately, it isn't just the metadata that can be different between
different producers.

for a project i'm working on, similar in scope to 1000 genomes, variants
at the same location for different subjects are collapsed into one VCF
record and there is a list of possible alleles, one per collapsed variant.
i'm not working directly with the VCF but there is also an annotation
field to tell whether the variant is in a coding region, an exon or an
intron, or outside a coding region, whether the change makes a difference
in the amino acid, plus that all depends on which isoforms are possible.
even tho it is reasonably documented, there still are gotcha's.

the one thing i have found tho, is that one can base one's mapping on the
source of the document.  two vcf's from the same provider will likely be
parsable by the same code.  so i usually do what most would do, use a base
class to handle most of the parsing then have a derived class per provider
and try to drive everything through a configuration file.

cheers,
michael

Michael Miller
Software Engineer
Institute for Systems Biology

-----Original Message-----
From: Jeremy J Carroll [mailto:jjc@syapse.com]
Sent: Monday, April 01, 2013 2:02 PM
To: Chris Mungall
Cc: Kingsley Idehen; HCLS hcls
Subject: Re: VCF and RDF, at Clinical Pharmacogenomics TF, Wed Apr 3rd


This looks really helpful ..

I suspect that the key questions remain about how to effectively use the
data

FALDO seems at first glance to be fairly simplistic compared with the

model

implicit in VCF.
Maybe FALDO is actually more useful, concentrating on the important

stuff

...

thanks


Jeremy J Carroll
Principal Architect
Syapse, Inc.



On Apr 1, 2013, at 12:40 PM, Chris Mungall <cjmungall@lbl.gov> wrote:


Apologies if this has been covered already I haven't been following

the

whole discussion.


Genome variant data is just a subset of genome data. My understanding

is

that the semweb BioHackathon group looked at a variety of different

kinds

of genomic data and came up with FALDO[1]. This model looks pretty good
to me, and importantly there is a converter from GFF3[2,3]. Of all the
commonly used genome feature formats out there, GFF3 is by far the best

at

encouraging provision of relevant metadata using standard
ontologies/terminologies.


VCF is convertible to GVF[4,5] which is a subset of GFF3 with

additional

recommended metadata. It's supported by Ensembl, gbGap and others, and
the 1000genomes data is available in GVF[6].


As GFF3 is convertible to RDF/OWL that uses FALDO and SO, it follows

that

GVF is too (though the converter may need tweaking to take advantage of
the additional GVF metadata).


I just wanted to make sure you were aware of all this previous work

before reinventing anything.


[1] https://github.com/JervenBolleman/FALDO
[2] http://www.sequenceontology.org/gff3.shtml
[3] https://code.google.com/p/gff3-to-owl/
[4] http://www.ncbi.nlm.nih.gov/pubmed/20796305 - A standard variation

file format for human genome sequences - Reese at al

[5] http://www.sequenceontology.org/resources/gvf.html
[6] ftp://ftp.ensembl.org/pub/current_variation/gvf/homo_sapiens/

On Apr 1, 2013, at 10:59 AM, Jeremy J Carroll wrote:

Hi Kingsley,

I wasn't going to but since you ask:

http://www.slideshare.net/JeremyJCarroll/vcf-and-rdf

or

http://lists.w3.org/Archives/Public/www-archive/2013Apr/att-

0002/W3C-JJC-LifeSci.pdf



Jeremy J Carroll
Principal Architect
Syapse, Inc.



On Apr 1, 2013, at 10:13 AM, Kingsley Idehen

<kidehen@openlinksw.com> wrote:



On 4/1/13 1:05 PM, Jeremy J Carroll wrote:

Hi

I am hoping to present the work I am currently doing on VCF and RDF

at

the Clinical Pharamcogenomics TF telecom on Wednesday.


My presentation should cover:

- business background, Syapse Discovery
- some background on VCF as a knowledge representation format
- and some initial results on mapping 1000 genomes into RDF

I will circulate slides shortly


Jeremy J Carroll
Principal Architect
Syapse, Inc.





Hopefully you'll publish to Slideshare?

--

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile:

https://plus.google.com/112399767740508618350/about

LinkedIn Profile: http://www.linkedin.com/in/kidehen
Received on Thursday, 4 April 2013 15:44:43 UTC