RE: Unifying RDF Provenance Use Case: Trust from Myers, Jim on 2011-09-22 (public-prov-wg@w3.org from September 2011)

From: Myers, Jim <MYERSJ4@rpi.edu>
Date: Thu, 22 Sep 2011 15:24:22 +0000
To: "public-rdf-prov@w3.org" <public-rdf-prov@w3.org>
CC: public-rdf-wg <public-rdf-wg@w3.org>, Provenance WG <public-prov-wg@w3.org>
Message-ID: <3131E7DF4CD2D94287870F5A931EFC230295BDDA@EX14MB2.win.rpi.edu>
Sandro,

I think the general use case sounds reasonable. One quick comment - the Time Lag issue for Mal's restaurant seems to be more of a thing/entity issue to me - "Mal's restaurant" is ill-defined and represents two restaurants - those run by Jr and senior. While things like address are stable for both, quality is not and hence we need entities to distinguish the two. Just to make it clearer that it isn't just a time issue - suppose you try to find 'people like you' to decide whether you trust their reviews - you need to know which 'Mal's' they are reviewing but the old reviews are still useful in comparing the tastes of those who went to 'Mal Sr's'... none of this invalidates the point that there is an issue, it's just about how to characterize it.

 Jim

> -----Original Message-----
> From: public-prov-wg-request@w3.org [mailto:public-prov-wg-
> request@w3.org] On Behalf Of Sandro Hawke
> Sent: Wednesday, September 21, 2011 12:55 AM
> To: public-rdf-prov@w3.org
> Cc: public-rdf-wg; Provenance WG
> Subject: Unifying RDF Provenance Use Case: Trust
> 
> [Please reply to public-rdf-prov@w3.org, not either WG lists.  If you're
> interested in seeing replies, please subscribe to that list or read its archives
> [3].]
> 
> During the joint RDF/Provenance task force call last week [0], I agreed to
> draft a single, concrete use case for this work.  At the time, I had forgotten
> about the Graphs Use Cases page [1], and no one mentioned it.  So I spent
> some time thinking about it, and talking to Eric Prud'hommeaux.  I haven't
> yet gone through [1] to determine how each of those relates to this analysis,
> and I'm headed into a meeting that will probably stop me returning to this
> for a while.  So I'm going to just send this now.
> 
> It seems to me the driving use case here is the desire to glean usable
> information from imperfect sources.  Reduced to a single word, the use case
> is Trust.  In TimBL's now-ancient Layer Cake vision for the Semantic Web,
> the top layer is "Web of Trust" or just "Trust" [2].  How can people act based
> on information they find, when that information might be not be right? How
> can the system itself help us know what to trust?  Is it possible to make
> parts of a system more trustworthy than the elements on which they rely?
> (I think Google has convinced everyone that can be done; can it be done in an
> open/interoperable way?)
> 
> Here's my minimal concrete use case:
> 
>    Alice wants to find a good, local seafood restaurant.  She has many
>    ways to find restaurant reviews in RDF -- some embedded in people's
>    blogs, some exported from sites which help people author reviews,
>    some exported from sites which extract and aggregrate reviews from
>    other sites -- and she'd like to know which sources she can trust.
>    Actually, she'd like the computer to do that for her, and just
>    analyze the trustworthy data.  Is there a way the Web can convey
>    metadata about those reviews that lets software assess the relative
>    reliability of the different sources?
> 
> That's the short version.  For the rest of this message, I'm going to:
> 
>    1.  Explore reasons the data might not be trustworthy.  Trust isn't
>        just about lies; it's about all the reasons data might be
>        imperfect.
> 
>    2.  Explore other application domains, showing how the same issues
>        arise.  This isn't just about seafood restaurants, of course,
>        or even just about consumers making choices. It's also about
>        medical research, political processes, corporate IT, etc.
> 
>    3.  A few thoughts about solutions.  It's what you'd probably
>        expect; we need a way in RDF to convey the kind of information
>        needed to determine the trustworthiness of other RDF
>        sources. We need to be able to talk about particular
>        statements, about particular curated collections of statements,
>        and about the people and organizations behind those statements
>        and databases.
> 
> == (1) Some Reasons Data Is Imperfect ==
> 
> There are many reasons why information found in RDF might not be
> trustworthy.  In many cases it is still useful and may be the best information
> available.  For simplicity, the reasons are here applied first to the classic
> example problem of selecting a seafood restaurant.  The reasons have much
> wider applicability, however, and more application domains are explored in
> section 2.
> 
>     DECEPTION: Alice is trying to find the best local seafood
>     restaurant using reviews posted by various earlier patrons.  One
>     restaurant, Mal's Mollusks, attempts to trick her by posting many
>     positive reviews using fake identities.
> 
>     ERROR: Errol tries to post of glowing review of his favorite
>     restaurant, Mel's Mellon Soups, but accidentally files it under
>     Mal's.  Alice might be led down the wrong path (to eating at
>     Mal's) by Errol's mistake.
> 
>     SIMPLIFICATION: Simon makes a point of trying a new restaurant
>     every day, but doesn't like to keep detailed records.  After a
>     while, he comes to the opinion that all the Seafood restaurants in
>     town are really quite good.  One day, while visiting a restaurant
>     review site, he quickly rates them all as such, without bothering
>     to notice that he's never even tried Mal's.  (He wouldn't consider
>     this a mistake; for his purposes, this was good enough data.)
> 
>     TIME LAG: Mal is actually Mal Jr, having taken over the restaurant
>     from his father, Mal Sr.  Mal Sr ran a great restaurant (the
>     finest squid dumplings in Texas), but it's gone steeply downhill
>     his since Mal Jr took over.  Some of the reviews from the old days
>     still rightly glow about Mal Sr's restaurant.
> 
>     SUBJECTIVITY: Some people actually like Mal Jr's cooking.  There's no
>     accounting for taste, but perhaps the other things these people
>     like, if Alice knew about them, could give her some clue to
>     disregard their high opinion of Mal's.
> 
> This list of five reasons is not meant to be exhaustive; it's just all I could
> think of today.
> 
> == (2) Some Other Problem Domains ==
> 
> Trust reasoning comes up in many other problems domain, of course.
> Here are two more example domains to show how the need for trust
> reasoning applies beyond selecting reviews of potential partners in
> commercial transactions.
> 
> 
> Science
> 
>      When one researcher (Alice) is considering building on the work
>      reported by another researcher (Robbie), similar trust issues
>      arise.  Here, the consequences can be quite serious.
> 
>      DECEPTION: Did Robbie falsify results, in order to publish?
> 
>      ERROR: Did Robbie (or one of his assistants) make an honest but
>      undetected mistake?
> 
>      SIMPLIFICATION: This may be the hardest to avoid: what
>      simplifying assumptions did Robbie make?  They may be common in
>      the field, but perhaps Alice is in a different sub-field, or a
>      different part of the world, or a different time, when the
>      assumptions are different.
> 
>      TIME LAG: Perhaps Robbie publishes environmental sample data from
>      his city on a monthly basis.  For studying a larger picture,
>      Alice may need to know exactly when the samples were taken and
>      how recent the "current" ones are.
> 
>      SUBJECTIVITY: Robbie's work with human subjects was approved by
>      his university's research ethics board, perhaps their standards
>      are different from those Alice wants to endorse by building on
>      them.  Or: Robbie's assistants had to use judgment to classify
>      some results; another set of assistants might have classified
>      them differently.
> 
> An Employee Directory
> 
>      A large company, formed largely by acquiring smaller companies,
>      maintains an on-line directory of office locations, phone
>      numbers, email addresses, job titles, etc, for its millions of
>      employees across 12 continents, on nine planets :-).  Alice is
>      trying to use it to find Bob's address, so she can mail him the
>      hat he left at a meeting at her site.
> 
>      DECEPTION: Mallory is engaged in corporate espionage and has
>      altered the directory for this week so Bob's mail actually goes
>      to his office; he's waiting for some key blueprints to be
>      delivered, then he'll change the address back, probably before
>      Bob notices.  He'll be surprised by the hat.
> 
>      ERROR: Charlie, a coder in Bob's division, made an error in his
>      routine to export that division's phone book upstream; the error
>      causes truncation of the last character of the building name,
>      turning Bob's "Building 21" into "Building 2".
> 
>      SIMPLIFICATION: Bob actually has two different offices in
>      different buildings, and works from home most of the time.  He
>      had to pick one of the phone book.  He'll end up not getting the
>      hat for an extra week because of this.
> 
>      TIME LAG: Bob switched offices 6 months ago.  It took him 2
>      months to get around to updating the phone book, and the upstream
>      data flow is only makes it all the way through every six months,
>      so Alice still sees his old address.
> 
>      SUBJECTIVITY: Bob's building has several different names and
>      nicknames it has acquired over the years.  Bob, and a few others
>      in his group still call it the "AI Building", so that's what he
>      put in the phone book.  The new kid in the mail room doesn't know
>      that term, so the package gets returned or delayed.
> 
> There are other areas, of course, that call for trust reasoning, such
> as:
> 
> - political decision making (voting, donating)
> 
> - information used to match employers with employees (hiring, job
>   search)
> 
> - information used in expanding ones social network (connecting with
>   new colleagues, friends, dating)
> 
> ... and I'm sure many, many more.  If you need information, you need to
> know if you can trust it.
> 
> == (3)  Solutions ==
> 
> So what do these example have in common, and how might we address
> them with some standard technology?
> 
> In every case, the data consumer (Alice) obtains some information (the
> data) and would benefit from having some additional information (the
> metadata) which would help her to determine whether or how she can
> safely rely on the data.
> 
> The metadata might come from the data provider, disclaiming or clarifying
> it.  It might also come from an intermediary or aggregator, saying how and
> where they got it.  Or it could come from many different kinds of third
> parties, like ratings agencies, the public, or the information consumer' social
> network.
> 
> I see an interest division in the kinds of metadata:
> 
>    1.  is the data she retrieved trustworthy?
>    2.  is the person/organization who authored that data trustworthy
>    3.  is the data source (database URL) she retrieved it from trustworthy?
>    4.  is the person/organization who runs the data source trustworthy?
> 
> These can be quite different.  The people can be trustworthy but run a data
> source full of admittedly low quality data.  Or a database of data that's
> mostly correct can have some bad triples in it.
> 
> Note that it's possible for metadata to have its own metadata.  For instance,
> statement S1 may be declared untrustworthy by person P1 who is declared
> untrustworthy by person P2 who is declared trustworthy in a statement
> available at source U1, etc, etc.  Ideally there's a chain of trustworthyness
> assertions rooted at a known trustworthy source, but I suspect that will
> rarely be the case.  More likely, I expect to see a lot of triples that amount to
> "+1" and "-1" from one source applied to another.  Hopefully there will be
> more explanation included, and it will be clear whether it's applied to
> data/content (a g-snap), a the database in general, over time, (a g-box), or
> the data author, or the database maintainer (an agent).
> 
> Well, that's all I have time for right now.  Hopefully this will help clarify
> what some of us are hoping for here.  To be clear, I should say I'm not
> expect either WG to *solve* these problems, just to give us some building
> blocks that enable system builders to make some progress on solving them.
> 
> One more observations: digging into any of these use cases, it's clear to me I
> can solve that particular one without any standards work beyond settling on
> the vocabulary for that use case.  That is, I can build the
> provenance vocabulary into the application vocabulary.   I think the
> goal here, however, is to factor that work out, because it's common to so
> many application areas.
> 
>      -- Sandro
> 
> [0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15
> [1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC

> [2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html

>  or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html

>  or http://www.w3.org/2007/03/layerCake.png

> [3] http://lists.w3.org/Archives/Public/public-rdf-prov/

>
Received on Thursday, 22 September 2011 15:24:56 UTC