Unifying RDF Provenance Use Case: Trust from Sandro Hawke on 2011-09-21 (public-rdf-wg@w3.org from September 2011)

From: Sandro Hawke <sandro@w3.org>
Date: Tue, 20 Sep 2011 21:54:37 -0700
To: public-rdf-prov@w3.org
Cc: public-rdf-wg <public-rdf-wg@w3.org>, Provenance WG <public-prov-wg@w3.org>
Message-ID: <1316580877.2491.26.camel@waldron>
[Please reply to public-rdf-prov@w3.org, not either WG lists.  If
you're interested in seeing replies, please subscribe to that list or
read its archives [3].]

During the joint RDF/Provenance task force call last week [0], I agreed
to draft a single, concrete use case for this work.  At the time, I had
forgotten about the Graphs Use Cases page [1], and no one mentioned
it.  So I spent some time thinking about it, and talking to Eric
Prud'hommeaux.  I haven't yet gone through [1] to determine how each of
those relates to this analysis, and I'm headed into a meeting that will
probably stop me returning to this for a while.  So I'm going to just
send this now.

It seems to me the driving use case here is the desire to glean usable
information from imperfect sources.  Reduced to a single word, the use
case is Trust.  In TimBL's now-ancient Layer Cake vision for the
Semantic Web, the top layer is "Web of Trust" or just "Trust" [2].  How
can people act based on information they find, when that information
might be not be right? How can the system itself help us know what to
trust?  Is it possible to make parts of a system more trustworthy than
the elements on which they rely?  (I think Google has convinced
everyone that can be done; can it be done in an open/interoperable way?)

Here's my minimal concrete use case:

   Alice wants to find a good, local seafood restaurant.  She has many
   ways to find restaurant reviews in RDF -- some embedded in people's
   blogs, some exported from sites which help people author reviews,
   some exported from sites which extract and aggregrate reviews from
   other sites -- and she'd like to know which sources she can trust.
   Actually, she'd like the computer to do that for her, and just
   analyze the trustworthy data.  Is there a way the Web can convey
   metadata about those reviews that lets software assess the relative
   reliability of the different sources? 

That's the short version.  For the rest of this message, I'm going to:

   1.  Explore reasons the data might not be trustworthy.  Trust isn't
       just about lies; it's about all the reasons data might be
       imperfect.

   2.  Explore other application domains, showing how the same issues
       arise.  This isn't just about seafood restaurants, of course,
       or even just about consumers making choices. It's also about
       medical research, political processes, corporate IT, etc.

   3.  A few thoughts about solutions.  It's what you'd probably
       expect; we need a way in RDF to convey the kind of information
       needed to determine the trustworthiness of other RDF
       sources. We need to be able to talk about particular
       statements, about particular curated collections of statements,
       and about the people and organizations behind those statements
       and databases.

== (1) Some Reasons Data Is Imperfect ==

There are many reasons why information found in RDF might not be
trustworthy.  In many cases it is still useful and may be the best
information available.  For simplicity, the reasons are here applied
first to the classic example problem of selecting a seafood
restaurant.  The reasons have much wider applicability, however, and
more application domains are explored in section 2.

    DECEPTION: Alice is trying to find the best local seafood
    restaurant using reviews posted by various earlier patrons.  One
    restaurant, Mal's Mollusks, attempts to trick her by posting many
    positive reviews using fake identities.

    ERROR: Errol tries to post of glowing review of his favorite
    restaurant, Mel's Mellon Soups, but accidentally files it under
    Mal's.  Alice might be led down the wrong path (to eating at
    Mal's) by Errol's mistake.

    SIMPLIFICATION: Simon makes a point of trying a new restaurant
    every day, but doesn't like to keep detailed records.  After a
    while, he comes to the opinion that all the Seafood restaurants in
    town are really quite good.  One day, while visiting a restaurant
    review site, he quickly rates them all as such, without bothering
    to notice that he's never even tried Mal's.  (He wouldn't consider
    this a mistake; for his purposes, this was good enough data.)

    TIME LAG: Mal is actually Mal Jr, having taken over the restaurant
    from his father, Mal Sr.  Mal Sr ran a great restaurant (the
    finest squid dumplings in Texas), but it's gone steeply downhill
    his since Mal Jr took over.  Some of the reviews from the old days
    still rightly glow about Mal Sr's restaurant.

    SUBJECTIVITY: Some people actually like Mal Jr's cooking.  There's no
    accounting for taste, but perhaps the other things these people
    like, if Alice knew about them, could give her some clue to
    disregard their high opinion of Mal's.

This list of five reasons is not meant to be exhaustive; it's just all
I could think of today.

== (2) Some Other Problem Domains ==

Trust reasoning comes up in many other problems domain, of course.
Here are two more example domains to show how the need for trust
reasoning applies beyond selecting reviews of potential partners in
commercial transactions.


Science

     When one researcher (Alice) is considering building on the work
     reported by another researcher (Robbie), similar trust issues
     arise.  Here, the consequences can be quite serious.

     DECEPTION: Did Robbie falsify results, in order to publish?

     ERROR: Did Robbie (or one of his assistants) make an honest but
     undetected mistake?

     SIMPLIFICATION: This may be the hardest to avoid: what
     simplifying assumptions did Robbie make?  They may be common in
     the field, but perhaps Alice is in a different sub-field, or a
     different part of the world, or a different time, when the
     assumptions are different.

     TIME LAG: Perhaps Robbie publishes environmental sample data from
     his city on a monthly basis.  For studying a larger picture,
     Alice may need to know exactly when the samples were taken and
     how recent the "current" ones are.

     SUBJECTIVITY: Robbie's work with human subjects was approved by
     his university's research ethics board, perhaps their standards
     are different from those Alice wants to endorse by building on
     them.  Or: Robbie's assistants had to use judgment to classify
     some results; another set of assistants might have classified
     them differently.

An Employee Directory 

     A large company, formed largely by acquiring smaller companies,
     maintains an on-line directory of office locations, phone
     numbers, email addresses, job titles, etc, for its millions of
     employees across 12 continents, on nine planets :-).  Alice is
     trying to use it to find Bob's address, so she can mail him the
     hat he left at a meeting at her site.

     DECEPTION: Mallory is engaged in corporate espionage and has
     altered the directory for this week so Bob's mail actually goes
     to his office; he's waiting for some key blueprints to be
     delivered, then he'll change the address back, probably before
     Bob notices.  He'll be surprised by the hat.

     ERROR: Charlie, a coder in Bob's division, made an error in his
     routine to export that division's phone book upstream; the error
     causes truncation of the last character of the building name,
     turning Bob's "Building 21" into "Building 2".

     SIMPLIFICATION: Bob actually has two different offices in
     different buildings, and works from home most of the time.  He
     had to pick one of the phone book.  He'll end up not getting the
     hat for an extra week because of this.

     TIME LAG: Bob switched offices 6 months ago.  It took him 2
     months to get around to updating the phone book, and the upstream
     data flow is only makes it all the way through every six months,
     so Alice still sees his old address.

     SUBJECTIVITY: Bob's building has several different names and
     nicknames it has acquired over the years.  Bob, and a few others
     in his group still call it the "AI Building", so that's what he
     put in the phone book.  The new kid in the mail room doesn't know
     that term, so the package gets returned or delayed.

There are other areas, of course, that call for trust reasoning, such
as:

- political decision making (voting, donating)

- information used to match employers with employees (hiring, job
  search)

- information used in expanding ones social network (connecting with
  new colleagues, friends, dating)

... and I'm sure many, many more.  If you need information, you need to
know if you can trust it.

== (3)  Solutions == 

So what do these example have in common, and how might we address them
with some standard technology?

In every case, the data consumer (Alice) obtains some information (the
data) and would benefit from having some additional information (the
metadata) which would help her to determine whether or how she can
safely rely on the data. 

The metadata might come from the data provider, disclaiming or
clarifying it.  It might also come from an intermediary or aggregator,
saying how and where they got it.  Or it could come from many
different kinds of third parties, like ratings agencies, the public,
or the information consumer' social network.

I see an interest division in the kinds of metadata:

   1.  is the data she retrieved trustworthy?
   2.  is the person/organization who authored that data trustworthy
   3.  is the data source (database URL) she retrieved it from trustworthy?
   4.  is the person/organization who runs the data source trustworthy?

These can be quite different.  The people can be trustworthy but
run a data source full of admittedly low quality data.  Or a database
of data that's mostly correct can have some bad triples in it.

Note that it's possible for metadata to have its own metadata.  For
instance, statement S1 may be declared untrustworthy by person P1 who
is declared untrustworthy by person P2 who is declared trustworthy in
a statement available at source U1, etc, etc.  Ideally there's a chain
of trustworthyness assertions rooted at a known trustworthy source,
but I suspect that will rarely be the case.  More likely, I expect to
see a lot of triples that amount to "+1" and "-1" from one source
applied to another.  Hopefully there will be more explanation
included, and it will be clear whether it's applied to data/content (a
g-snap), a the database in general, over time, (a g-box), or the
data author, or the database maintainer (an agent).

Well, that's all I have time for right now.  Hopefully this will help
clarify what some of us are hoping for here.  To be clear, I should
say I'm not expect either WG to *solve* these problems, just to give
us some building blocks that enable system builders to make some
progress on solving them.

One more observations: digging into any of these use cases, it's clear
to me I can solve that particular one without any standards work beyond
settling on the vocabulary for that use case.  That is, I can build the
provenance vocabulary into the application vocabulary.   I think the
goal here, however, is to factor that work out, because it's common to
so many application areas.

     -- Sandro

[0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15
[1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC
[2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html
 or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html
 or http://www.w3.org/2007/03/layerCake.png
[3] http://lists.w3.org/Archives/Public/public-rdf-prov/
Received on Wednesday, 21 September 2011 04:54:50 UTC