Re: 'Variants' and Chromosome Modelling from Jeremy J Carroll on 2013-03-27 (public-semweb-lifesci@w3.org from March 2013)

From: Jeremy J Carroll <jjc@syapse.com>
Date: Wed, 27 Mar 2013 12:54:52 -0700
To: "Freimuth, Robert R., Ph.D." <Freimuth.Robert@mayo.edu>
Cc: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-Id: <3F251C63-292F-481C-8C99-AC74463A03B1@syapse.com>

Hi Bob

I am a message behind in my thinking, thanks for all your input.

My use case is a lot less clear than it could be … in that the contribution I am seeking to make lies somewhere in the tool-chain, not at the very bottom, but also not right next to the scientists or clinicians working with the sequencing results.
Hence, one of my current goals is trying to work out what people are currently actually doing. So for example, with the analysis steps metadata, it seems that while it is important enough that some people add the metadata to their files, it has so far not proved necessary to automate such tracking, but when this provenance proves necessary, someone casts an expert eye over the metadata.

On Mar 26, 2013, at 10:58 AM, "Freimuth, Robert R., Ph.D." <Freimuth.Robert@mayo.edu> wrote:

> In my opinion, the real potential of using semweb technologies with genetic data is in the layers of interpretation that are built from the genetic sequences. While the underlying genetic sequences can be rebuilt and refined over time, there are plenty of existing tools that can manage this process very efficiently. Our collective knowledge about those sequences, however, advances continuously. Changes in our understanding in one area might cascade into others. We need a way to dynamically update the interpretations and discover novel relationships. Genetic data (in any format) + biological knowledge (in RDF?) + reasoners could be a powerful combination.

On this part, I too, have been coming from the assumption that this is the key goal: linking the results from the sequencing through to higher and higher level information, and it seems at first blush that the keys to linking with data not in the VCF file, or in some way abducing new hypotheses, are to be found in the INFO block or maybe the per sample block of each row of the VCF file. & this process seems strange as I come across more and more internal structure in the fields and sub-fields, which require specialized parsing to make sense of them, and what seems like somewhat limited tool support for doing so.

It seems that to some extent the tools for processing VCF currently use the VCF file as a single star shaped table for OLAP type analysis, by adding denormalized information from a variety of public and private databases into the INFO field.

Hence, I am, at some level, inclined to work on just opening up the data format - essentially by parsing anything with a regular format and mapping everything into triples …. but then of course, there are too many triples …. easily trillions for moderate sizes.
It feels that without such an opening up, that the owners of the knowledge (the scientists and the clinicians) are imprisoned by the owners of the knowledge format (the bioinformaticians) … of course, to some extent that is inevitable.

Jeremy J Carroll
Principal Architect
Syapse, Inc.

Received on Wednesday, 27 March 2013 19:55:23 UTC