Re: Observations about facts in genomics from Peter Ansell on 2013-03-22 (public-semweb-lifesci@w3.org from March 2013)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Fri, 22 Mar 2013 16:43:47 +1000
To: Alan Ruttenberg <alanruttenberg@gmail.com>
Cc: Jeremy J Carroll <jjc@syapse.com>, Jerven Bolleman <me@jerven.eu>, Graham Klyne <graham.klyne@zoo.ox.ac.uk>, w3c semweb HCLS <public-semweb-lifesci@w3.org>, Pat Hayes <phayes@ihmc.us>
Message-ID: <CAGYFOCQAn-Y1uxofCCvDrwZXk5ewaTb1izA9yQ=5TPQ+q2XHjA@mail.gmail.com>
On 22 March 2013 14:38, Alan Ruttenberg <alanruttenberg@gmail.com> wrote:
> On Thu, Mar 21, 2013 at 7:56 PM, Peter Ansell <ansell.peter@gmail.com>
> wrote:
>> On 22 March 2013 12:05, Alan Ruttenberg <alanruttenberg@gmail.com> wrote:
>> > On Wed, Mar 20, 2013 at 3:15 PM, Jeremy J Carroll <jjc@syapse.com>
>> > wrote:
>> >>
>> >> To me, that seems to lead us back to the earlier discussion (rathole?)
>> >> about owl:sameAs
>> >> I tend to a view that there are diminishing returns in terms of levels
>> >> of
>> >> indirection here!
>> >
>> > As the number of levels of indirection increases, perhaps. But here we
>> > are
>> > talking about 1 level - separating claims from truth.
>>
>> The question that scientists spend their lives trying to establish is
>> the one that you seem to think is clearly defined in this statement,
>> ie, "seperating claims from 'truth'". In some domains, such as
>> logic/mathematics, "truth" is easy to define, and that seems to be the
>> basis that the RDF specifications use to justify their semantics.
>> However, in others, such as life sciences (ie, the domain of
>> public-semweb-lifesci), at least some of the best information we have
>> is approximate idealist information that may not exactly match
>> anything at all in reality (ie, large genome reference assemblies that
>> are statistically modelled from multiple samples but may not actually
>> match base for base with any actual DNA strands in the real world).
>> These approximations are referenced directly by scientists in their
>> publications without them having to qualify every statement as
>> referencing a "claim".
>
>
> When they need to say it is a claim they do so, either referring to the
> matter in that way or by language signals in their text. In the other cases
> they are different consequences for getting things wrong. If they assert
> something directly and their result depends on it then their result will be
> called into question.
>
> I am not saying that science presented as fact is infallible. Of course it
> is[n't]. But when we talk about things as fact we tend to back it up by implicit
> agreement to revise if the thing presented as fact is determined to be
> false. I (and Foundry) take this situation as one where the ontological
> commitment is one in which, should we make such statements and find them to
> be wrong, we will fix them. There's lots of cases where that isn't the
> commitment, Jeremy's case being one of them, I suspect. And there are plenty
> of true things (things that no one would object to) despite this. That the
> information is about a dna sequence, that it is about differences between
> humans, that it is about an amino acid change at one place in the molecule,
> etc.

The trouble is that you are talking about the absolute simplest cast,
the upper ontology, and at that level it is still difficult to
identify whether something is real or whether the papers that
referenced it were faked/bogus (in the worst case). At lower levels
each sample needs to be given a unique name, I agree totally with
that. However, the difficulty is that apparently the sample name
cannot migrate through the levels of indecision between there and
"upper ontology name agreement" status without going through a wide
range of experiments, each of which may infact be looking at a
slightly different gene sequence, while there is no way to disprove
the hypothesis after the fact that they may have been looking at
exactly the same gene sequence, as the samples may have died or
morphed etc (ie, inability to go back in time and do the same thing
again in more detail).

> Associated with each of the three kinds of logical commitment in those
> slides I wrote what inconsistency means. IMO, if you have some system and
> there's no way for you to be wrong in your usage then it's not worth using.
> Please tell me, given your assessment of scientific use of RDF, if there is
> any such use that can be wrong or inconsistent? If you can't then I'm
> guessing we are going to have a continued 'failure to communicate'.(attempt
> at humor, culture specific reference:
> http://en.wikipedia.org/wiki/What_we've_got_here_is_(a)_failure_to_communicate)

:)

I have a critical realist philosophy background, so it is quite
similar to the background that you seem to be coming from, with one
important difference in my opinion. Concepts are free to move both up
and down the scale without the community having to universally agree
that they are in a particular scale at a particular point in time. Ie,
not everyone has to accept that a gene sequence is an accurate
representation of a particular sample for it to be published and given
an official name. In the same way, not everyone has to accept that an
official name does not infact relate to any one instance of a
particular organisms genetic materials before it is dumped from
"official name" status down to merely an observation that was proved
to be wrong. If RDF semantics currently requires that new names have
to be created whenever something moves between these statuses then it
is inefficient at best, and troublesome at worst. There will always be
in any widely distributed system a lack of clear join points where the
change is executed and a database somewhere is modified and the
official definition of the gene is revoked from the shared scientific
record. It may even after all of the current databases are modified
still exist in other locations that cannot ever be modified due to
their shared nature (ie, attached as evidence to a publication that
has been widely distributed).

>>
>> I am not sure why you say that there is only one layer of wrapping
>> needed. I can think of many different situations where someone could
>> have more than one layer of alternative interpretations that they may
>> need to accommodate other scientists now and in the future. The 4 or
>> so layers that the provenance ontology has just for published
>> documents are worrying enough, and they may not be enough to map the
>> complexities of genome reference assemblies, as genomics researchers
>> may have a different "publication" workflow to book publishers.
>
>
> Since I am not familiar with the PROV model (I tried to read it through but
> got frustrated), please say a little more, and justify why you think these
> "layers" need be represented as levels of indirection rather than assertions
> on a first or second level such as I have described.

The layers of indirection in PROV are based around being specific
about what your users will want to talk about, and at the same time
being specific enough about what you are talking about that others get
some idea of the context that you made the statement in. For example,
if you are talking about a book for instance you may have statements
where you are not sure whether the user will want to know exactly
which book you were looking at when you made the statement, or noone
may ever care or need to know that you were looking at "edition 2
published as a hard cover by a particular printer with the back page
ripped out" when they look at your statement. In informal
communication (which I understand is not directly going to map to RDF,
the context of the statement could be inferred as the conversation
moves along. In RDF however the context must be fixed as you say so
that it is completely unambiguous what you were exactly talking about
when you made the statement, so that in future the statement cannot be
misinterpreted. I am not sure what you mean by indirection
(misdirection may be a better term ;) just kidding), but to me each of
the different levels of granularity will affect the way you are able
to make the statement and the way someone is going to be able to
interpret it.

The really sad thing (not A Good Thing(tm)), is that talking about
editions of books is magnitudes easier than talking about unique
genomes and unique biological/chemical environments that affect the
way the body interprets its own genome. That is the reason we are
having this discussion in my opinion. We have to be able to simplify
the model so that it suits the needs of scientists without them
getting lost in the irrelevant details. Physicists are the best at
this in my opinion. They endlessly search for ways to make variables
in equations completely redundant, based on the magnitude of the
effects from the variable changing, so that they can move on to make
further assertions and then test those theories further.

>>
>> > 2) I think there's a big difference between what one publishes on the
>> > web,
>> > and what one uses in the privacy of one's home, so to speak. If one is
>> > publishing on the web, it is good citizenship to respect specifications,
>> > and
>> > to consider the impact of one's assertions on the broader data consumer
>> > community. That consideration, IMO, is justification enough for the 1
>> > extra
>> > indirection necessary to not make statements that are too strong.
>>
>> The specifications seem to be based on premises that the practicing
>> scientists may not ever accept. Ie, the idea that there is static
>> scientific "truth" that can be unamgiuously and continuously
>> communicated, and not "challengable current theories" that can be
>> either alternatively stated, or gradually or suddenly revoked and
>> replaced with new best theories. Scientists need to be able to
>> interpret, contrast, and concurrently utilise, past information
>> directly without having to suddenly wrap up past "truths" inside of
>> "claims" because they may be out of date with something someone else
>> has now put into the RDF-sphere. The whole idea that statements could
>> be "too strong" takes its basis from "static truth" and I cannot
>> personally accept that we need to represent everything for life
>> sciences inside of "claims" (or alternatively have everyone create new
>> URIs for everything they want to talk about) just incase it changes in
>> future or someone would find it difficult to deal with the statement
>> if their application relies on a different structure for their queries
>> to work.
>
>
> No. The specification is based on the premise that if you are going to share
> information there have to at least be some rules. The rules were developed
> by a skilled working group, the semantics were written by an expert, and the
> whole survived what can be a rather brutal W3C approval process. There is
> room in those semantics to express what you want. People seem to be annoyed
> that it takes an extra link, some extra thinking, to do that. Tough.

If it were just one extra link then more people might be doing it. The
process that I am trying to get at on this thread is to try to
override the rules in a systematic manner. Saying that you could not
possibly do anything with RDF if you don't follow the rules to the dot
is not realising the power of RDF to me. The power that you could mix
and match data from future and past, manually remap it to your
interpretation and republish it to generate new results if what
scientists do every day. It is not really that encouraging to hear
that once someone says anything that everyone is bound to it, based on
the theory for the data model that we are using. I am free to mix and
match datasets, and interpret some datasets differently to others, and
it is what people will actually do. You may get some stubborn
individuals who completely refuse to use a dataset because they found
a typo in it, but in generally I think the majority of scientists will
be happy to mix and match. Reducing the difficulty of mixing and
matching would definitely be useful, but not necessary in absolutely
every case. It is mainly the insistence on the absoluteness of the
theory that I am worried about when I hear people say that if you
don't like something recreate it from scratch yourself. That is just
not how knowledge is created.

It is not just extra thinking and more communication that is
necessary. You may be the only person working on a particular thing
and you need to name it before you can talk about it, but your name
partially overlaps with someone elses enough for others to start using
that name. In general society you are rejected as an outcast if you
don't follow the crowd and convert to using their names, but that is
not possible in RDF, where you have to be precise about what you are
doing. Do you follow or do you stay truthful to your RDF data model
semantics if you personally know you are correct (even if everyone
else has yet to realise that)? In most cases you will accept that the
common name is different but similar enough to provide you with your
next round of funding if you reuse it and you never go back to using
your initial name again for practicality reasons. It seems that in RDF
everyone will have to effectively become outcasts by recreating their
own names if they are in any way, or may ever become, slightly
different. It isn't about being perfect in science for me, rather than
accepting imperfection as the norm and wading through things for
pearls of knowledge that you can't easily write a rule to find for
you, as the Semantic Web was hoped to do for us if we got all of our
naming perfect and in synchronization. From economics theory, if there
is a simple way to do something, someone else has already done it and
has run with the cash!

>> If someone else has a completely different problem domain that would
>> find it difficult to deal with direct, "un-framed"/"un-claim-wrapped"
>> statements from third-parties using a URI because they clash with some
>> of their statements or assumptions, how would the claim wrapping
>> practically help them?
>
>
> I think you have this backward. Naive engineers (meaning those that haven't
> hung around with the people on this list) will read the spec and have
> expectations about how things work, such as that one URI represents one
> resource, independent of "context". The idea that its ok to break the rules
> because they are inconvenient is the equivalent of thinking it's ok to be a
> vandal. It's your responsibility as an educated engineer to understand and
> use the spec you are using in the documented way, or to write a different
> one. If you want to talk about specific problems you have with indirection,
> let's talk about that. But it is clear that the onus is on you to figure out
> a way to use the technology as specified, rather than me to solve your (at
> the moment vague and unspecified) usage problems.

I have expectations that the "Semantic Web" is a clever misnomer for a
set of webs, which only really interact at a very high level, and
rarely in semantic ways. For example, it is unlikely that there will
be a meaningful specific link when a biology term is referenced from a
politics article on Wikipedia. Hence, the way an RDF version of
Wikipedia would represent that link will likely be purely as a
syntactic link, and anyone who expects the link to have necessary
meaning, in whatever way they wish to interpret it,

I don't really expect you to solve the issue, as it is a broader issue
than just RDF. You would have the granularity issue, where context
(however that is defined) in whatever system you setup to arrange for
meaningful names to be assigned to complex situations. It is easy to
approach the issue from the very simple RDF theory and just say that
you have to assign a single name to every way that anyone would ever
wish to name a particular thing, and then define all of their
relationships exactly, and then we would have an agent that would do
every scientists job for them. (Hopefully that view of the Semantic
Web died a long time ago!)

One of the biggest issues in this discussion is how to reference
genomes that are not all unique, while also creating names for the
"statistically most similar" representative genomes for groups of
similar organisms. If you are only working at the level of
representative genomes, as upper ontologies do, then you don't need to
think about, or engineer practical solutions around the problem of
naming and managing the increasing detail down the line for how those
names are agreed on and discovered. However, if you don't know what
your users will want to talk about, then you can't rely on the high
level names, as you may accidentally say something that is not
universally consistent (read that as Gödel consistent if you want to
understand basically what I meant).

To achieve a typically consistent solution you would either need to
create every possible name in the various levels of granularity, from
specific base-pair-unique named sequence up to the finally agreed on
statistical model for the genome, or alternatively, you could just
stick with only using the most specific name and use inferencing to
derive the other names based on user-driven ontologies, which are not
necessarily widely shared upper ontologies.

Both of those have engineering difficulties, as the first is heavy on
disk-space, and the second is heavy on computation and forces people
to regenerate names for all hypothetical sequences, even if you didn't
completely sequence an entire genome and were unsure whether the rest
of the genome had any effect on the outcome of your conclusion or on
the realistic value of your statements. If, as the RDF theorists seem
to be saying, the focus must be on avoiding all collisions, past,
future and present, then you would have to go with the first solution
and customise your own names for everything, as the second may not be
consistent with what you meant in the future if the heirarchy changed,
and the first would at least have the benefit that if the heirarchy
changed, at least someone knew what you initially meant to manually
reconstruct your results based on your names and not the shared names
(if they are actually created after everyone is forced to rename
everything themselves just incase).

>>
>> Life scientists attempting to use RDF to model their heterogeneous
>> information aren't trying to make ambiguous statements or reject the
>> wisdom of the logic/maths backgrounds of the specifications authors,
>> they are just trying to get work done, and it seems that we are being
>> told that we are bad citizens for having a complex, "un-truthy"
>> domain.
>
>
> If I see a biologist doing mathematics, I'm going to look at whether they
> get it right. If they do representation I'm going to expect they do it
> right. I look to see that I they do the biology right too (best I can). The
> labs hire professionals to do their mass spec. Should we expect less for
> data?

I think that may be missing the point from my view. I don't view
myself as a software engineer trying to do something perfect based on
a software engineering theory. I view myself as a scientist trying to
use software engineering to make practical problems easier to think
about, while providing some small piece of detail about the realistic
environment that we live in. Whether I am able to eventually do that
in a way that everyone will agree on is highly doubtful, although that
won't stop me trying to recursively approach a set of solutions.

Peter
Received on Friday, 22 March 2013 06:44:15 UTC