Re: Datasets and contextual/temporal semantics from Pat Hayes on 2011-10-13 (public-rdf-wg@w3.org from October 2011)

From: Pat Hayes <phayes@ihmc.us>
Date: Thu, 13 Oct 2011 18:49:04 -0500
To: Dan Brickley <danbri@danbri.org>
Cc: Richard Cyganiak <richard@cyganiak.de>, RDF Working Group WG <public-rdf-wg@w3.org>
Message-Id: <F75AC3F8-72C2-41A5-8BA7-5C635B389A61@ihmc.us>
On Oct 13, 2011, at 9:21 AM, Dan Brickley wrote:

> On 13 October 2011 14:29, Pat Hayes <phayes@ihmc.us> wrote:
>> On Oct 13, 2011, at 6:10 AM, Richard Cyganiak wrote:
> 
>> Indeed, and that was DELIBERATE. A contextual logic (in the sense you are using it) simply does not work as a Web logic. For some discussion of this point, see  http://www.ihmc.us/users/phayes/IKL/GUIDE/GUIDE.html#LogicForInt . In fact, a contextual logic does not work for ontologies in general. If the truth of an assertion depends on the context in which it is asserted, and if this context is not available when it is read, then it is USELESS. Or maybe worse than useless.
> 
> Are you suggesting it is really practical and feasible for every
> assertion to be so explicit as to never need a 'best-before' date?

I dont see why not. I agree that the world isnt ever going to be this tidy, but its not HARD to do this. RDB management has been doing it pretty well for decades now, and most of the main ideas were probably develope by Pheonecian traders around the 10th century or so, about when the Arabs developed double-entry bookkeeping. But let me concede the point rather than argue it. Yes, real data will be 'contextual' and will need machinery to keep it up-to-date, etc., and some of it will be 'old' and yet still lying around the Web causing confusion. Sad, but true.

> Particularly in such a nuance-free language like RDF, I find this hard
> to believe. We can go the slippery slope towards only ever describing
> events,

? Who said anything about only describing events? Look at DBPedia. Most of it is effectively timeless. 

> since their descriptions don't go stale, but in an open world
> (where relevant facts may always be missing),

That is a completely seperate issue. Of course information may be missing. But take a simple example. It is one thing to have dated information with some dates missing. Then you know you have missed something when you look for the date but don't find it, for example. But it is quite another to say, we don't need to have any dates in our data because it is all contextual, and implicitly dated by the time it was first written. That is the 'context' stance, and it has several disastrous consequences. As RIchard points out, it is now impossible to merge information, since it might for all you know be from different dates and so apparent contradictions might not be. (Mashups are now impossible. ALL of them.) Second, you have no way to know if information is missing or not: its ALL missing, in effect. So it is much harder to catch errors, mistypings, mistakes, etc.. 

> the utility of having a
> big pile of event descriptions is often questionable.
> 
>>> Many of our problems stem from that.
>>> 
>>> I'll give examples.
>>> 
>>>   :G2010 {:alice :age 29.}
>>>   :G2011 {:alice :age 30.}
>>> 
>>> Individually, each of those graphs are true (at a certain point in time). If taken together, an inconsistency is inferred (assuming :age is a functional property):
>>> 
>>>   :alice :age 29, 30.
>>> 
>>> By merging the two graphs, we have discarded the contextual information.
>> 
>> In RDF, that "contextual information" was never there in the first place. This is BAD RDF.
> 
> You may as well call the Web "bad"; but it's not going away. And nor
> is simple factual data published in Web pages --- a big use case for
> our stuff.

Factual data that is time-dependent but not given with time information is bad. I will stand by this and argue it for ever. Im sure people do bad things, no doubt. But our stance should be to tell them what not to do and why its bad, not to be so flexible that it is impossible to even distinguish bad data from good data.

> 
> Practical example: (repeating something just aired during the F2F/telecon)
> 
> * in early FOAF stuff we tried to urge people towards
> decontextualised data that won't go stale. So for example here, to
> describe date of birth / events, rather than 'age'.
> * FOAF now has age? Why --- because Peter Mika asked for it, because
> he was involved with sites (e.g. MySpace) who are publishing the 'age'
> of users in HTML.
> * Should we be mailing MySpace and telling them to publish date/year
> of birth instead of age?

No, you (your code) should be doing a little arithmetic and then storing good data. God knows, this isnt hard, right?

> Maybe it'd be good for The Youth to be forced
> to do more mental arithmetic? But standards != advocacy; we can't fix
> the world from a committee.

You could have 
> * with the rise of RDFa (and microformats, microdata etc) many
> factual assertions will come from such (database-driven) sites.
> 
> So "bad RDF" is perhaps not the most helpful perspective here.

I have no problem with factual data, only with *contextual* data.

> 
> Is there any value in going from sites publishing stuff like
>  <p>Dan is 39</p>
> to
> <p typeof="Person"><a href="http://danbri.org/" rel="homepage"
> property="firstName">Dan</a> is <span rel="age">30</a></p> ?
> 
> ... I think so. But it puts work onto the consumer of the data: we
> need to remember where we got it. And maybe a whole pile of other info
> too. Anyone doing data aggregation is familiar with such requirements,
> even if they are hard to express in logical languages. This doesn't
> make either bad; but we have work to do bridging between the logical
> and data-hacking perspectives.
> 
> And maybe this also puts some work onto the RDF community: that we
> should make some experiments (yes, research + hacking, not standards)
> around annotating properties, to indicate that our property 'age' is
> more """volatile""" than our property 'dateOfBirth'. And perhaps even
> specifically that 'age' goes stale relatively quickly (in whatever
> level of detail suits application demands). For some
> as-yet-undocumented notion of """volatile""".

Neat idea. I have seen OWL ontologies - in OWL-Full of course - which do exactly this for annotating military intelligence data, by the way. Note, the very *possibility* of doing this in the actual language depends on having temporal information made explicit in the axioms rather than hidden inside the logical semantics. Knowing volatility is only useful if you can compare when the record was made with the current time. So you need to record that time-of-writing somewhere. And when you do that, you have decontextualized the data. 

But anyway, here in the WG, what do we do? Do we 'adapt' RDF so that we abandon even the simple notion of consistency that might enable someone to even detect that data is out of date or has errors in it? If so, IMO we have simply given up on the very idea of the semantic web. 

> 
>>> This shows that the graph merge operation is *not truth-preserving* – not *valid* in the formal sense – *if* the merged graphs have different contexts.
>> 
>> No, it shows that they don't have contexts. Graph merging is truth preserving, precisely because RDF is *not* a contextual logic.
> 
> RDF is not a contextual logic; it is and should remain a simple minded
> language that can be used to make fairly basic assertions about a/the
> world. RDF's cartoon universe has no notion of time nor change.

No, wrong. This is a common misapprehension. There is nothing in the RDF model theory that stops RDF talking about time and change. You can write the most sophisticated OWL2/RDF temporal ontology you like, you can have have relativistic branching time, whatever you like. Its been done, the ontologies are already published. There is nothing 'cartoony' about this, and it has time and change out the wazoo. If you want to go to a richer logic, there are even more possibilities. (See http://www.ihmc.us/users/phayes/docs/timeCatalog.pdf, written 15 years ago now so probably way out of date.)  What you can't do however is to have the *making of assertions* be time-relative. The logic DESCRIBES the time but is not EMBEDDED in time. Similarly for RDF.

So to return to your point here. If you are saying that RDF is limited to very simple data, data that cannot have any temporal information in it, *because* RDF has no notion of time and change (in contrast to SPARQL, which does?), then I sharply disagree. RDF can describe time and change. In fact, you get a lot of  extra leverage from having all this time information in the actual data where it can be reasoned about.

What RDF (and OWL and OWL2 and FOL and Common Logic and IKL and higher-order logic and modal logic and ...) cannot do is keep track of consequences of **assertions which are made, contextually, at times** and whose value may change with time. Logics are not temporally embedded in the way that programming languages are. If you like, they do not have any way to deal with "real time", the time that is passing while inferences are happening. But, I would claim, for most data recording purposes, we don't want formalisms which do this. The last thing you want in your accounts or tax records or government data or geographical concordance is indexicals like 'now' or 'here' whose meanings depend on when or where they were written. OK, I know there are things like age records. But they are *dated*, aren't they? And that date record is exactly what de-contextualizes them and lets people at other times figure out what those ages really mean when the data is actually used. My drivers license records that I was 59 *on the date it was issued*, which is also on the license. 

> However the people using RDF have to build systems that bridge this
> simplified perspective back into our real lives, software
> applications, ever-changing datasets etc., where time and change are
> constantly messing with us.
> 
> This is (as I think Richard articulated quite nicely) at the heart of
> our problem. RDF's worldview is super-super-simplified.

No, really, it isn't. This stubborn misconception is causing our discussions to be constantly taken off track. The issue is not one of simplicity vs. complexity. It is how to deal with *inferences* made from data which can change. It is what AI called the 'truth maintenance" problem. If I have some data D and some conclusions C are drawn from it and stored somewhere, and then D is changed to D', my warrant for inferring C has gone away. But there is nothing in C to say that this has happened; and in the Web setting, there is no way for D or D's owner to transmit a warning or some update information to C. So the whole pattern of validity of inference has become infected by this volatility of the data. 

> To live with
> this simplicity, we need some tricks, techniques and so on. What we
> have to figure out, is which of those tricks and techniques are
> (something like) data-hacking folklore and which can be specified
> using the other instruments of W3C committee-dom, namely testcases,
> computer languages, semantics specs and so on.
> 
> It will do us no good at all to just stand here and say "don't use
> properties like 'age' ...".

Well, I think we could recommend this, and say why, to make the point pretty clearly. Even if it is only given as good-practice advice.

> What we can say is "if you use properties
> like 'age', ... consider managing and sharing your data with the
> following conventions.".

Right, I wholly agree. My point to Richard was meant to convey that we should not castrate the entire logic and destroy the basic notions of consistency and entailment just to accomodate properties like "age" and make them simple to "use". (I use scare quotes here because if graphs cannot be merged, then RDF cannot reallly be used for anything.)

> 
> This theme btw underpins some of my concerns with Sandro's advocacy
> for a simple "I got these triples from this IRI" version of
> WebArch-for-SemWeb. In too many real-world scenarios, we'll want to
> keep a whole packet of information telling us where a bunch of data
> came from. And it might have come from the same basic IRI several
> times under varying circumstances. (Specs like
> http://www.w3.org/TR/HTTP-in-RDF10/ are a good start at keeping that
> "how I transacted with the Web, and what I got back" data diary.)

Sounds like something the provenance WG should be talking about. 

> 
> All this doesn't mean that data can only ever be considered in
> contexts. Just that we need to get better, much better, at providing
> all kinds of hints to help application developer and consuming apps
> flatten things down from contextualised and quoted representations,
> into simple flat truthy assertions.

Stop it with the 'simple'. In fact, these will be more complicated than the contextual ones seem to be (which is exactly why people are more comfortable with eg. age than date of birth.) 

> We will make different flattenings
> under different circumstances, depending on risk scenarios, data
> availability and other worldly constraints. This is natural and
> healthy, and leaves RDF as simple propositional content while
> admitting that there is (e.g. via SPARQL) a rich set of data
> management practices around it that absolutely do need to deal
> pragmatically with time, change and provenance.

I utterly agree. I hope I never said anything that suggested otherwise. Although I think that locating these management practices only in SPARQL is a strategic mistake. We can also give good advice about how to write RDF better. 

Pat

> 
> cheers,
> 
> Dan
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 13 October 2011 23:49:40 UTC