- From: Peter Ansell <ansell.peter@gmail.com>
- Date: Fri, 22 Mar 2013 16:43:47 +1000
- To: Alan Ruttenberg <alanruttenberg@gmail.com>
- Cc: Jeremy J Carroll <jjc@syapse.com>, Jerven Bolleman <me@jerven.eu>, Graham Klyne <graham.klyne@zoo.ox.ac.uk>, w3c semweb HCLS <public-semweb-lifesci@w3.org>, Pat Hayes <phayes@ihmc.us>
On 22 March 2013 14:38, Alan Ruttenberg <alanruttenberg@gmail.com> wrote: > On Thu, Mar 21, 2013 at 7:56 PM, Peter Ansell <ansell.peter@gmail.com> > wrote: >> On 22 March 2013 12:05, Alan Ruttenberg <alanruttenberg@gmail.com> wrote: >> > On Wed, Mar 20, 2013 at 3:15 PM, Jeremy J Carroll <jjc@syapse.com> >> > wrote: >> >> >> >> To me, that seems to lead us back to the earlier discussion (rathole?) >> >> about owl:sameAs >> >> I tend to a view that there are diminishing returns in terms of levels >> >> of >> >> indirection here! >> > >> > As the number of levels of indirection increases, perhaps. But here we >> > are >> > talking about 1 level - separating claims from truth. >> >> The question that scientists spend their lives trying to establish is >> the one that you seem to think is clearly defined in this statement, >> ie, "seperating claims from 'truth'". In some domains, such as >> logic/mathematics, "truth" is easy to define, and that seems to be the >> basis that the RDF specifications use to justify their semantics. >> However, in others, such as life sciences (ie, the domain of >> public-semweb-lifesci), at least some of the best information we have >> is approximate idealist information that may not exactly match >> anything at all in reality (ie, large genome reference assemblies that >> are statistically modelled from multiple samples but may not actually >> match base for base with any actual DNA strands in the real world). >> These approximations are referenced directly by scientists in their >> publications without them having to qualify every statement as >> referencing a "claim". > > > When they need to say it is a claim they do so, either referring to the > matter in that way or by language signals in their text. In the other cases > they are different consequences for getting things wrong. If they assert > something directly and their result depends on it then their result will be > called into question. > > I am not saying that science presented as fact is infallible. Of course it > is[n't]. But when we talk about things as fact we tend to back it up by implicit > agreement to revise if the thing presented as fact is determined to be > false. I (and Foundry) take this situation as one where the ontological > commitment is one in which, should we make such statements and find them to > be wrong, we will fix them. There's lots of cases where that isn't the > commitment, Jeremy's case being one of them, I suspect. And there are plenty > of true things (things that no one would object to) despite this. That the > information is about a dna sequence, that it is about differences between > humans, that it is about an amino acid change at one place in the molecule, > etc. The trouble is that you are talking about the absolute simplest cast, the upper ontology, and at that level it is still difficult to identify whether something is real or whether the papers that referenced it were faked/bogus (in the worst case). At lower levels each sample needs to be given a unique name, I agree totally with that. However, the difficulty is that apparently the sample name cannot migrate through the levels of indecision between there and "upper ontology name agreement" status without going through a wide range of experiments, each of which may infact be looking at a slightly different gene sequence, while there is no way to disprove the hypothesis after the fact that they may have been looking at exactly the same gene sequence, as the samples may have died or morphed etc (ie, inability to go back in time and do the same thing again in more detail). > Associated with each of the three kinds of logical commitment in those > slides I wrote what inconsistency means. IMO, if you have some system and > there's no way for you to be wrong in your usage then it's not worth using. > Please tell me, given your assessment of scientific use of RDF, if there is > any such use that can be wrong or inconsistent? If you can't then I'm > guessing we are going to have a continued 'failure to communicate'.(attempt > at humor, culture specific reference: > http://en.wikipedia.org/wiki/What_we've_got_here_is_(a)_failure_to_communicate) :) I have a critical realist philosophy background, so it is quite similar to the background that you seem to be coming from, with one important difference in my opinion. Concepts are free to move both up and down the scale without the community having to universally agree that they are in a particular scale at a particular point in time. Ie, not everyone has to accept that a gene sequence is an accurate representation of a particular sample for it to be published and given an official name. In the same way, not everyone has to accept that an official name does not infact relate to any one instance of a particular organisms genetic materials before it is dumped from "official name" status down to merely an observation that was proved to be wrong. If RDF semantics currently requires that new names have to be created whenever something moves between these statuses then it is inefficient at best, and troublesome at worst. There will always be in any widely distributed system a lack of clear join points where the change is executed and a database somewhere is modified and the official definition of the gene is revoked from the shared scientific record. It may even after all of the current databases are modified still exist in other locations that cannot ever be modified due to their shared nature (ie, attached as evidence to a publication that has been widely distributed). >> >> I am not sure why you say that there is only one layer of wrapping >> needed. I can think of many different situations where someone could >> have more than one layer of alternative interpretations that they may >> need to accommodate other scientists now and in the future. The 4 or >> so layers that the provenance ontology has just for published >> documents are worrying enough, and they may not be enough to map the >> complexities of genome reference assemblies, as genomics researchers >> may have a different "publication" workflow to book publishers. > > > Since I am not familiar with the PROV model (I tried to read it through but > got frustrated), please say a little more, and justify why you think these > "layers" need be represented as levels of indirection rather than assertions > on a first or second level such as I have described. The layers of indirection in PROV are based around being specific about what your users will want to talk about, and at the same time being specific enough about what you are talking about that others get some idea of the context that you made the statement in. For example, if you are talking about a book for instance you may have statements where you are not sure whether the user will want to know exactly which book you were looking at when you made the statement, or noone may ever care or need to know that you were looking at "edition 2 published as a hard cover by a particular printer with the back page ripped out" when they look at your statement. In informal communication (which I understand is not directly going to map to RDF, the context of the statement could be inferred as the conversation moves along. In RDF however the context must be fixed as you say so that it is completely unambiguous what you were exactly talking about when you made the statement, so that in future the statement cannot be misinterpreted. I am not sure what you mean by indirection (misdirection may be a better term ;) just kidding), but to me each of the different levels of granularity will affect the way you are able to make the statement and the way someone is going to be able to interpret it. The really sad thing (not A Good Thing(tm)), is that talking about editions of books is magnitudes easier than talking about unique genomes and unique biological/chemical environments that affect the way the body interprets its own genome. That is the reason we are having this discussion in my opinion. We have to be able to simplify the model so that it suits the needs of scientists without them getting lost in the irrelevant details. Physicists are the best at this in my opinion. They endlessly search for ways to make variables in equations completely redundant, based on the magnitude of the effects from the variable changing, so that they can move on to make further assertions and then test those theories further. >> >> > 2) I think there's a big difference between what one publishes on the >> > web, >> > and what one uses in the privacy of one's home, so to speak. If one is >> > publishing on the web, it is good citizenship to respect specifications, >> > and >> > to consider the impact of one's assertions on the broader data consumer >> > community. That consideration, IMO, is justification enough for the 1 >> > extra >> > indirection necessary to not make statements that are too strong. >> >> The specifications seem to be based on premises that the practicing >> scientists may not ever accept. Ie, the idea that there is static >> scientific "truth" that can be unamgiuously and continuously >> communicated, and not "challengable current theories" that can be >> either alternatively stated, or gradually or suddenly revoked and >> replaced with new best theories. Scientists need to be able to >> interpret, contrast, and concurrently utilise, past information >> directly without having to suddenly wrap up past "truths" inside of >> "claims" because they may be out of date with something someone else >> has now put into the RDF-sphere. The whole idea that statements could >> be "too strong" takes its basis from "static truth" and I cannot >> personally accept that we need to represent everything for life >> sciences inside of "claims" (or alternatively have everyone create new >> URIs for everything they want to talk about) just incase it changes in >> future or someone would find it difficult to deal with the statement >> if their application relies on a different structure for their queries >> to work. > > > No. The specification is based on the premise that if you are going to share > information there have to at least be some rules. The rules were developed > by a skilled working group, the semantics were written by an expert, and the > whole survived what can be a rather brutal W3C approval process. There is > room in those semantics to express what you want. People seem to be annoyed > that it takes an extra link, some extra thinking, to do that. Tough. If it were just one extra link then more people might be doing it. The process that I am trying to get at on this thread is to try to override the rules in a systematic manner. Saying that you could not possibly do anything with RDF if you don't follow the rules to the dot is not realising the power of RDF to me. The power that you could mix and match data from future and past, manually remap it to your interpretation and republish it to generate new results if what scientists do every day. It is not really that encouraging to hear that once someone says anything that everyone is bound to it, based on the theory for the data model that we are using. I am free to mix and match datasets, and interpret some datasets differently to others, and it is what people will actually do. You may get some stubborn individuals who completely refuse to use a dataset because they found a typo in it, but in generally I think the majority of scientists will be happy to mix and match. Reducing the difficulty of mixing and matching would definitely be useful, but not necessary in absolutely every case. It is mainly the insistence on the absoluteness of the theory that I am worried about when I hear people say that if you don't like something recreate it from scratch yourself. That is just not how knowledge is created. It is not just extra thinking and more communication that is necessary. You may be the only person working on a particular thing and you need to name it before you can talk about it, but your name partially overlaps with someone elses enough for others to start using that name. In general society you are rejected as an outcast if you don't follow the crowd and convert to using their names, but that is not possible in RDF, where you have to be precise about what you are doing. Do you follow or do you stay truthful to your RDF data model semantics if you personally know you are correct (even if everyone else has yet to realise that)? In most cases you will accept that the common name is different but similar enough to provide you with your next round of funding if you reuse it and you never go back to using your initial name again for practicality reasons. It seems that in RDF everyone will have to effectively become outcasts by recreating their own names if they are in any way, or may ever become, slightly different. It isn't about being perfect in science for me, rather than accepting imperfection as the norm and wading through things for pearls of knowledge that you can't easily write a rule to find for you, as the Semantic Web was hoped to do for us if we got all of our naming perfect and in synchronization. From economics theory, if there is a simple way to do something, someone else has already done it and has run with the cash! >> If someone else has a completely different problem domain that would >> find it difficult to deal with direct, "un-framed"/"un-claim-wrapped" >> statements from third-parties using a URI because they clash with some >> of their statements or assumptions, how would the claim wrapping >> practically help them? > > > I think you have this backward. Naive engineers (meaning those that haven't > hung around with the people on this list) will read the spec and have > expectations about how things work, such as that one URI represents one > resource, independent of "context". The idea that its ok to break the rules > because they are inconvenient is the equivalent of thinking it's ok to be a > vandal. It's your responsibility as an educated engineer to understand and > use the spec you are using in the documented way, or to write a different > one. If you want to talk about specific problems you have with indirection, > let's talk about that. But it is clear that the onus is on you to figure out > a way to use the technology as specified, rather than me to solve your (at > the moment vague and unspecified) usage problems. I have expectations that the "Semantic Web" is a clever misnomer for a set of webs, which only really interact at a very high level, and rarely in semantic ways. For example, it is unlikely that there will be a meaningful specific link when a biology term is referenced from a politics article on Wikipedia. Hence, the way an RDF version of Wikipedia would represent that link will likely be purely as a syntactic link, and anyone who expects the link to have necessary meaning, in whatever way they wish to interpret it, I don't really expect you to solve the issue, as it is a broader issue than just RDF. You would have the granularity issue, where context (however that is defined) in whatever system you setup to arrange for meaningful names to be assigned to complex situations. It is easy to approach the issue from the very simple RDF theory and just say that you have to assign a single name to every way that anyone would ever wish to name a particular thing, and then define all of their relationships exactly, and then we would have an agent that would do every scientists job for them. (Hopefully that view of the Semantic Web died a long time ago!) One of the biggest issues in this discussion is how to reference genomes that are not all unique, while also creating names for the "statistically most similar" representative genomes for groups of similar organisms. If you are only working at the level of representative genomes, as upper ontologies do, then you don't need to think about, or engineer practical solutions around the problem of naming and managing the increasing detail down the line for how those names are agreed on and discovered. However, if you don't know what your users will want to talk about, then you can't rely on the high level names, as you may accidentally say something that is not universally consistent (read that as Gödel consistent if you want to understand basically what I meant). To achieve a typically consistent solution you would either need to create every possible name in the various levels of granularity, from specific base-pair-unique named sequence up to the finally agreed on statistical model for the genome, or alternatively, you could just stick with only using the most specific name and use inferencing to derive the other names based on user-driven ontologies, which are not necessarily widely shared upper ontologies. Both of those have engineering difficulties, as the first is heavy on disk-space, and the second is heavy on computation and forces people to regenerate names for all hypothetical sequences, even if you didn't completely sequence an entire genome and were unsure whether the rest of the genome had any effect on the outcome of your conclusion or on the realistic value of your statements. If, as the RDF theorists seem to be saying, the focus must be on avoiding all collisions, past, future and present, then you would have to go with the first solution and customise your own names for everything, as the second may not be consistent with what you meant in the future if the heirarchy changed, and the first would at least have the benefit that if the heirarchy changed, at least someone knew what you initially meant to manually reconstruct your results based on your names and not the shared names (if they are actually created after everyone is forced to rename everything themselves just incase). >> >> Life scientists attempting to use RDF to model their heterogeneous >> information aren't trying to make ambiguous statements or reject the >> wisdom of the logic/maths backgrounds of the specifications authors, >> they are just trying to get work done, and it seems that we are being >> told that we are bad citizens for having a complex, "un-truthy" >> domain. > > > If I see a biologist doing mathematics, I'm going to look at whether they > get it right. If they do representation I'm going to expect they do it > right. I look to see that I they do the biology right too (best I can). The > labs hire professionals to do their mass spec. Should we expect less for > data? I think that may be missing the point from my view. I don't view myself as a software engineer trying to do something perfect based on a software engineering theory. I view myself as a scientist trying to use software engineering to make practical problems easier to think about, while providing some small piece of detail about the realistic environment that we live in. Whether I am able to eventually do that in a way that everyone will agree on is highly doubtful, although that won't stop me trying to recursively approach a set of solutions. Peter
Received on Friday, 22 March 2013 06:44:15 UTC