W3C home > Mailing lists > Public > public-rdf-prov@w3.org > September 2011

Re: Unifying RDF Provenance Use Case: Trust

From: David Wood <david@3roundstones.com>
Date: Wed, 21 Sep 2011 11:55:36 -0400
Message-Id: <1CA10280-2A2F-4806-BEE9-8CF4A8DEC271@3roundstones.com>
To: public-rdf-prov@w3.org
Hi Sandro,

Can you please explain how and whether this use case differs from:


On Sep 21, 2011, at 24:54, Sandro Hawke wrote:

> [Please reply to public-rdf-prov@w3.org, not either WG lists.  If
> you're interested in seeing replies, please subscribe to that list or
> read its archives [3].]
> During the joint RDF/Provenance task force call last week [0], I agreed
> to draft a single, concrete use case for this work.  At the time, I had
> forgotten about the Graphs Use Cases page [1], and no one mentioned
> it.  So I spent some time thinking about it, and talking to Eric
> Prud'hommeaux.  I haven't yet gone through [1] to determine how each of
> those relates to this analysis, and I'm headed into a meeting that will
> probably stop me returning to this for a while.  So I'm going to just
> send this now.
> It seems to me the driving use case here is the desire to glean usable
> information from imperfect sources.  Reduced to a single word, the use
> case is Trust.  In TimBL's now-ancient Layer Cake vision for the
> Semantic Web, the top layer is "Web of Trust" or just "Trust" [2].  How
> can people act based on information they find, when that information
> might be not be right? How can the system itself help us know what to
> trust?  Is it possible to make parts of a system more trustworthy than
> the elements on which they rely?  (I think Google has convinced
> everyone that can be done; can it be done in an open/interoperable way?)
> Here's my minimal concrete use case:
>   Alice wants to find a good, local seafood restaurant.  She has many
>   ways to find restaurant reviews in RDF -- some embedded in people's
>   blogs, some exported from sites which help people author reviews,
>   some exported from sites which extract and aggregrate reviews from
>   other sites -- and she'd like to know which sources she can trust.
>   Actually, she'd like the computer to do that for her, and just
>   analyze the trustworthy data.  Is there a way the Web can convey
>   metadata about those reviews that lets software assess the relative
>   reliability of the different sources? 
> That's the short version.  For the rest of this message, I'm going to:
>   1.  Explore reasons the data might not be trustworthy.  Trust isn't
>       just about lies; it's about all the reasons data might be
>       imperfect.
>   2.  Explore other application domains, showing how the same issues
>       arise.  This isn't just about seafood restaurants, of course,
>       or even just about consumers making choices. It's also about
>       medical research, political processes, corporate IT, etc.
>   3.  A few thoughts about solutions.  It's what you'd probably
>       expect; we need a way in RDF to convey the kind of information
>       needed to determine the trustworthiness of other RDF
>       sources. We need to be able to talk about particular
>       statements, about particular curated collections of statements,
>       and about the people and organizations behind those statements
>       and databases.
> == (1) Some Reasons Data Is Imperfect ==
> There are many reasons why information found in RDF might not be
> trustworthy.  In many cases it is still useful and may be the best
> information available.  For simplicity, the reasons are here applied
> first to the classic example problem of selecting a seafood
> restaurant.  The reasons have much wider applicability, however, and
> more application domains are explored in section 2.
>    DECEPTION: Alice is trying to find the best local seafood
>    restaurant using reviews posted by various earlier patrons.  One
>    restaurant, Mal's Mollusks, attempts to trick her by posting many
>    positive reviews using fake identities.
>    ERROR: Errol tries to post of glowing review of his favorite
>    restaurant, Mel's Mellon Soups, but accidentally files it under
>    Mal's.  Alice might be led down the wrong path (to eating at
>    Mal's) by Errol's mistake.
>    SIMPLIFICATION: Simon makes a point of trying a new restaurant
>    every day, but doesn't like to keep detailed records.  After a
>    while, he comes to the opinion that all the Seafood restaurants in
>    town are really quite good.  One day, while visiting a restaurant
>    review site, he quickly rates them all as such, without bothering
>    to notice that he's never even tried Mal's.  (He wouldn't consider
>    this a mistake; for his purposes, this was good enough data.)
>    TIME LAG: Mal is actually Mal Jr, having taken over the restaurant
>    from his father, Mal Sr.  Mal Sr ran a great restaurant (the
>    finest squid dumplings in Texas), but it's gone steeply downhill
>    his since Mal Jr took over.  Some of the reviews from the old days
>    still rightly glow about Mal Sr's restaurant.
>    SUBJECTIVITY: Some people actually like Mal Jr's cooking.  There's no
>    accounting for taste, but perhaps the other things these people
>    like, if Alice knew about them, could give her some clue to
>    disregard their high opinion of Mal's.
> This list of five reasons is not meant to be exhaustive; it's just all
> I could think of today.
> == (2) Some Other Problem Domains ==
> Trust reasoning comes up in many other problems domain, of course.
> Here are two more example domains to show how the need for trust
> reasoning applies beyond selecting reviews of potential partners in
> commercial transactions.
> Science
>     When one researcher (Alice) is considering building on the work
>     reported by another researcher (Robbie), similar trust issues
>     arise.  Here, the consequences can be quite serious.
>     DECEPTION: Did Robbie falsify results, in order to publish?
>     ERROR: Did Robbie (or one of his assistants) make an honest but
>     undetected mistake?
>     SIMPLIFICATION: This may be the hardest to avoid: what
>     simplifying assumptions did Robbie make?  They may be common in
>     the field, but perhaps Alice is in a different sub-field, or a
>     different part of the world, or a different time, when the
>     assumptions are different.
>     TIME LAG: Perhaps Robbie publishes environmental sample data from
>     his city on a monthly basis.  For studying a larger picture,
>     Alice may need to know exactly when the samples were taken and
>     how recent the "current" ones are.
>     SUBJECTIVITY: Robbie's work with human subjects was approved by
>     his university's research ethics board, perhaps their standards
>     are different from those Alice wants to endorse by building on
>     them.  Or: Robbie's assistants had to use judgment to classify
>     some results; another set of assistants might have classified
>     them differently.
> An Employee Directory 
>     A large company, formed largely by acquiring smaller companies,
>     maintains an on-line directory of office locations, phone
>     numbers, email addresses, job titles, etc, for its millions of
>     employees across 12 continents, on nine planets :-).  Alice is
>     trying to use it to find Bob's address, so she can mail him the
>     hat he left at a meeting at her site.
>     DECEPTION: Mallory is engaged in corporate espionage and has
>     altered the directory for this week so Bob's mail actually goes
>     to his office; he's waiting for some key blueprints to be
>     delivered, then he'll change the address back, probably before
>     Bob notices.  He'll be surprised by the hat.
>     ERROR: Charlie, a coder in Bob's division, made an error in his
>     routine to export that division's phone book upstream; the error
>     causes truncation of the last character of the building name,
>     turning Bob's "Building 21" into "Building 2".
>     SIMPLIFICATION: Bob actually has two different offices in
>     different buildings, and works from home most of the time.  He
>     had to pick one of the phone book.  He'll end up not getting the
>     hat for an extra week because of this.
>     TIME LAG: Bob switched offices 6 months ago.  It took him 2
>     months to get around to updating the phone book, and the upstream
>     data flow is only makes it all the way through every six months,
>     so Alice still sees his old address.
>     SUBJECTIVITY: Bob's building has several different names and
>     nicknames it has acquired over the years.  Bob, and a few others
>     in his group still call it the "AI Building", so that's what he
>     put in the phone book.  The new kid in the mail room doesn't know
>     that term, so the package gets returned or delayed.
> There are other areas, of course, that call for trust reasoning, such
> as:
> - political decision making (voting, donating)
> - information used to match employers with employees (hiring, job
>  search)
> - information used in expanding ones social network (connecting with
>  new colleagues, friends, dating)
> ... and I'm sure many, many more.  If you need information, you need to
> know if you can trust it.
> == (3)  Solutions == 
> So what do these example have in common, and how might we address them
> with some standard technology?
> In every case, the data consumer (Alice) obtains some information (the
> data) and would benefit from having some additional information (the
> metadata) which would help her to determine whether or how she can
> safely rely on the data. 
> The metadata might come from the data provider, disclaiming or
> clarifying it.  It might also come from an intermediary or aggregator,
> saying how and where they got it.  Or it could come from many
> different kinds of third parties, like ratings agencies, the public,
> or the information consumer' social network.
> I see an interest division in the kinds of metadata:
>   1.  is the data she retrieved trustworthy?
>   2.  is the person/organization who authored that data trustworthy
>   3.  is the data source (database URL) she retrieved it from trustworthy?
>   4.  is the person/organization who runs the data source trustworthy?
> These can be quite different.  The people can be trustworthy but
> run a data source full of admittedly low quality data.  Or a database
> of data that's mostly correct can have some bad triples in it.
> Note that it's possible for metadata to have its own metadata.  For
> instance, statement S1 may be declared untrustworthy by person P1 who
> is declared untrustworthy by person P2 who is declared trustworthy in
> a statement available at source U1, etc, etc.  Ideally there's a chain
> of trustworthyness assertions rooted at a known trustworthy source,
> but I suspect that will rarely be the case.  More likely, I expect to
> see a lot of triples that amount to "+1" and "-1" from one source
> applied to another.  Hopefully there will be more explanation
> included, and it will be clear whether it's applied to data/content (a
> g-snap), a the database in general, over time, (a g-box), or the
> data author, or the database maintainer (an agent).
> Well, that's all I have time for right now.  Hopefully this will help
> clarify what some of us are hoping for here.  To be clear, I should
> say I'm not expect either WG to *solve* these problems, just to give
> us some building blocks that enable system builders to make some
> progress on solving them.
> One more observations: digging into any of these use cases, it's clear
> to me I can solve that particular one without any standards work beyond
> settling on the vocabulary for that use case.  That is, I can build the
> provenance vocabulary into the application vocabulary.   I think the
> goal here, however, is to factor that work out, because it's common to
> so many application areas.
>     -- Sandro
> [0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15
> [1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC
> [2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html
> or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html
> or http://www.w3.org/2007/03/layerCake.png
> [3] http://lists.w3.org/Archives/Public/public-rdf-prov/
Received on Wednesday, 21 September 2011 15:56:17 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:02:07 UTC