Re: Unifying RDF Provenance Use Case: Trust

Hi Sandro,

Thanks for sharing this.

Assuming the data is retrieved from the web (that is, all data is received as a representation of some resource that has a URL), then I believe that all these issues can be solved using three ingredients:

1. a graph store for named graphs,
2. vocabulary for expressing authorship, trust/reputation and source/mirror relationships,
3. an incentive for parties on the web to publish trust/reputation information.

I think that 1) already is standardized, 2) is to a large degree on the charter of the Provenance WG, and 3) is, well, the tough one.

It's not quite clear to me what role you see for the RDF WG in this?

In particular, I was hoping for some rationale for your position that RDF datasets as defined in SPARQL are insufficient to cover use cases for working with multiple graphs in RDF. As far as I can tell, the use cases you describe don't even require working with multiple graphs; they just require the ability to make statements about web resources. What requirements arise from these use cases that are not met by RDF datasets?

Thanks,
Richard


On 21 Sep 2011, at 05:54, Sandro Hawke wrote:

> [Please reply to public-rdf-prov@w3.org, not either WG lists.  If
> you're interested in seeing replies, please subscribe to that list or
> read its archives [3].]
> 
> During the joint RDF/Provenance task force call last week [0], I agreed
> to draft a single, concrete use case for this work.  At the time, I had
> forgotten about the Graphs Use Cases page [1], and no one mentioned
> it.  So I spent some time thinking about it, and talking to Eric
> Prud'hommeaux.  I haven't yet gone through [1] to determine how each of
> those relates to this analysis, and I'm headed into a meeting that will
> probably stop me returning to this for a while.  So I'm going to just
> send this now.
> 
> It seems to me the driving use case here is the desire to glean usable
> information from imperfect sources.  Reduced to a single word, the use
> case is Trust.  In TimBL's now-ancient Layer Cake vision for the
> Semantic Web, the top layer is "Web of Trust" or just "Trust" [2].  How
> can people act based on information they find, when that information
> might be not be right? How can the system itself help us know what to
> trust?  Is it possible to make parts of a system more trustworthy than
> the elements on which they rely?  (I think Google has convinced
> everyone that can be done; can it be done in an open/interoperable way?)
> 
> Here's my minimal concrete use case:
> 
>   Alice wants to find a good, local seafood restaurant.  She has many
>   ways to find restaurant reviews in RDF -- some embedded in people's
>   blogs, some exported from sites which help people author reviews,
>   some exported from sites which extract and aggregrate reviews from
>   other sites -- and she'd like to know which sources she can trust.
>   Actually, she'd like the computer to do that for her, and just
>   analyze the trustworthy data.  Is there a way the Web can convey
>   metadata about those reviews that lets software assess the relative
>   reliability of the different sources? 
> 
> That's the short version.  For the rest of this message, I'm going to:
> 
>   1.  Explore reasons the data might not be trustworthy.  Trust isn't
>       just about lies; it's about all the reasons data might be
>       imperfect.
> 
>   2.  Explore other application domains, showing how the same issues
>       arise.  This isn't just about seafood restaurants, of course,
>       or even just about consumers making choices. It's also about
>       medical research, political processes, corporate IT, etc.
> 
>   3.  A few thoughts about solutions.  It's what you'd probably
>       expect; we need a way in RDF to convey the kind of information
>       needed to determine the trustworthiness of other RDF
>       sources. We need to be able to talk about particular
>       statements, about particular curated collections of statements,
>       and about the people and organizations behind those statements
>       and databases.
> 
> == (1) Some Reasons Data Is Imperfect ==
> 
> There are many reasons why information found in RDF might not be
> trustworthy.  In many cases it is still useful and may be the best
> information available.  For simplicity, the reasons are here applied
> first to the classic example problem of selecting a seafood
> restaurant.  The reasons have much wider applicability, however, and
> more application domains are explored in section 2.
> 
>    DECEPTION: Alice is trying to find the best local seafood
>    restaurant using reviews posted by various earlier patrons.  One
>    restaurant, Mal's Mollusks, attempts to trick her by posting many
>    positive reviews using fake identities.
> 
>    ERROR: Errol tries to post of glowing review of his favorite
>    restaurant, Mel's Mellon Soups, but accidentally files it under
>    Mal's.  Alice might be led down the wrong path (to eating at
>    Mal's) by Errol's mistake.
> 
>    SIMPLIFICATION: Simon makes a point of trying a new restaurant
>    every day, but doesn't like to keep detailed records.  After a
>    while, he comes to the opinion that all the Seafood restaurants in
>    town are really quite good.  One day, while visiting a restaurant
>    review site, he quickly rates them all as such, without bothering
>    to notice that he's never even tried Mal's.  (He wouldn't consider
>    this a mistake; for his purposes, this was good enough data.)
> 
>    TIME LAG: Mal is actually Mal Jr, having taken over the restaurant
>    from his father, Mal Sr.  Mal Sr ran a great restaurant (the
>    finest squid dumplings in Texas), but it's gone steeply downhill
>    his since Mal Jr took over.  Some of the reviews from the old days
>    still rightly glow about Mal Sr's restaurant.
> 
>    SUBJECTIVITY: Some people actually like Mal Jr's cooking.  There's no
>    accounting for taste, but perhaps the other things these people
>    like, if Alice knew about them, could give her some clue to
>    disregard their high opinion of Mal's.
> 
> This list of five reasons is not meant to be exhaustive; it's just all
> I could think of today.
> 
> == (2) Some Other Problem Domains ==
> 
> Trust reasoning comes up in many other problems domain, of course.
> Here are two more example domains to show how the need for trust
> reasoning applies beyond selecting reviews of potential partners in
> commercial transactions.
> 
> 
> Science
> 
>     When one researcher (Alice) is considering building on the work
>     reported by another researcher (Robbie), similar trust issues
>     arise.  Here, the consequences can be quite serious.
> 
>     DECEPTION: Did Robbie falsify results, in order to publish?
> 
>     ERROR: Did Robbie (or one of his assistants) make an honest but
>     undetected mistake?
> 
>     SIMPLIFICATION: This may be the hardest to avoid: what
>     simplifying assumptions did Robbie make?  They may be common in
>     the field, but perhaps Alice is in a different sub-field, or a
>     different part of the world, or a different time, when the
>     assumptions are different.
> 
>     TIME LAG: Perhaps Robbie publishes environmental sample data from
>     his city on a monthly basis.  For studying a larger picture,
>     Alice may need to know exactly when the samples were taken and
>     how recent the "current" ones are.
> 
>     SUBJECTIVITY: Robbie's work with human subjects was approved by
>     his university's research ethics board, perhaps their standards
>     are different from those Alice wants to endorse by building on
>     them.  Or: Robbie's assistants had to use judgment to classify
>     some results; another set of assistants might have classified
>     them differently.
> 
> An Employee Directory 
> 
>     A large company, formed largely by acquiring smaller companies,
>     maintains an on-line directory of office locations, phone
>     numbers, email addresses, job titles, etc, for its millions of
>     employees across 12 continents, on nine planets :-).  Alice is
>     trying to use it to find Bob's address, so she can mail him the
>     hat he left at a meeting at her site.
> 
>     DECEPTION: Mallory is engaged in corporate espionage and has
>     altered the directory for this week so Bob's mail actually goes
>     to his office; he's waiting for some key blueprints to be
>     delivered, then he'll change the address back, probably before
>     Bob notices.  He'll be surprised by the hat.
> 
>     ERROR: Charlie, a coder in Bob's division, made an error in his
>     routine to export that division's phone book upstream; the error
>     causes truncation of the last character of the building name,
>     turning Bob's "Building 21" into "Building 2".
> 
>     SIMPLIFICATION: Bob actually has two different offices in
>     different buildings, and works from home most of the time.  He
>     had to pick one of the phone book.  He'll end up not getting the
>     hat for an extra week because of this.
> 
>     TIME LAG: Bob switched offices 6 months ago.  It took him 2
>     months to get around to updating the phone book, and the upstream
>     data flow is only makes it all the way through every six months,
>     so Alice still sees his old address.
> 
>     SUBJECTIVITY: Bob's building has several different names and
>     nicknames it has acquired over the years.  Bob, and a few others
>     in his group still call it the "AI Building", so that's what he
>     put in the phone book.  The new kid in the mail room doesn't know
>     that term, so the package gets returned or delayed.
> 
> There are other areas, of course, that call for trust reasoning, such
> as:
> 
> - political decision making (voting, donating)
> 
> - information used to match employers with employees (hiring, job
>  search)
> 
> - information used in expanding ones social network (connecting with
>  new colleagues, friends, dating)
> 
> ... and I'm sure many, many more.  If you need information, you need to
> know if you can trust it.
> 
> == (3)  Solutions == 
> 
> So what do these example have in common, and how might we address them
> with some standard technology?
> 
> In every case, the data consumer (Alice) obtains some information (the
> data) and would benefit from having some additional information (the
> metadata) which would help her to determine whether or how she can
> safely rely on the data. 
> 
> The metadata might come from the data provider, disclaiming or
> clarifying it.  It might also come from an intermediary or aggregator,
> saying how and where they got it.  Or it could come from many
> different kinds of third parties, like ratings agencies, the public,
> or the information consumer' social network.
> 
> I see an interest division in the kinds of metadata:
> 
>   1.  is the data she retrieved trustworthy?
>   2.  is the person/organization who authored that data trustworthy
>   3.  is the data source (database URL) she retrieved it from trustworthy?
>   4.  is the person/organization who runs the data source trustworthy?
> 
> These can be quite different.  The people can be trustworthy but
> run a data source full of admittedly low quality data.  Or a database
> of data that's mostly correct can have some bad triples in it.
> 
> Note that it's possible for metadata to have its own metadata.  For
> instance, statement S1 may be declared untrustworthy by person P1 who
> is declared untrustworthy by person P2 who is declared trustworthy in
> a statement available at source U1, etc, etc.  Ideally there's a chain
> of trustworthyness assertions rooted at a known trustworthy source,
> but I suspect that will rarely be the case.  More likely, I expect to
> see a lot of triples that amount to "+1" and "-1" from one source
> applied to another.  Hopefully there will be more explanation
> included, and it will be clear whether it's applied to data/content (a
> g-snap), a the database in general, over time, (a g-box), or the
> data author, or the database maintainer (an agent).
> 
> Well, that's all I have time for right now.  Hopefully this will help
> clarify what some of us are hoping for here.  To be clear, I should
> say I'm not expect either WG to *solve* these problems, just to give
> us some building blocks that enable system builders to make some
> progress on solving them.
> 
> One more observations: digging into any of these use cases, it's clear
> to me I can solve that particular one without any standards work beyond
> settling on the vocabulary for that use case.  That is, I can build the
> provenance vocabulary into the application vocabulary.   I think the
> goal here, however, is to factor that work out, because it's common to
> so many application areas.
> 
>     -- Sandro
> 
> [0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15
> [1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC
> [2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html
> or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html
> or http://www.w3.org/2007/03/layerCake.png
> [3] http://lists.w3.org/Archives/Public/public-rdf-prov/
> 
> 

Received on Thursday, 22 September 2011 17:48:12 UTC