Re: Unifying RDF Provenance Use Case: Trust from Sandro Hawke on 2011-09-22 (public-rdf-prov@w3.org from September 2011)

From: Sandro Hawke <sandro@w3.org>
Date: Thu, 22 Sep 2011 12:02:06 -0700
To: Richard Cyganiak <richard@cyganiak.de>,public-rdf-prov@w3.org
Message-ID: <d782e552-b576-41d6-87ce-0f6340cbdef7@email.android.com>
Richard Cyganiak <richard@cyganiak.de> wrote:

>Hi Sandro,
>
>Thanks for sharing this.
>
>Assuming the data is retrieved from the web (that is, all data is
>received as a representation of some resource that has a URL), then I
>believe that all these issues can be solved using three ingredients:
>
>1. a graph store for named graphs,
>2. vocabulary for expressing authorship, trust/reputation and
>source/mirror relationships,
>3. an incentive for parties on the web to publish trust/reputation
>information.

I didn't call this out in my examples, but how do you handle the cases where data changes?  How can I say that Errol got the name wrong, in a way which won't make me wrong if he corrects himself?

   -- Sandro (walking slowly down a jetway it SFO :-)

>I think that 1) already is standardized, 2) is to a large degree on the
>charter of the Provenance WG, and 3) is, well, the tough one.
>
>It's not quite clear to me what role you see for the RDF WG in this?
>
>In particular, I was hoping for some rationale for your position that
>RDF datasets as defined in SPARQL are insufficient to cover use cases
>for working with multiple graphs in RDF. As far as I can tell, the use
>cases you describe don't even require working with multiple graphs;
>they just require the ability to make statements about web resources.
>What requirements arise from these use cases that are not met by RDF
>datasets?
>
>Thanks,
>Richard
>
>
>On 21 Sep 2011, at 05:54, Sandro Hawke wrote:
>
>> [Please reply to public-rdf-prov@w3.org, not either WG lists.  If
>> you're interested in seeing replies, please subscribe to that list or
>> read its archives [3].]
>> 
>> During the joint RDF/Provenance task force call last week [0], I
>agreed
>> to draft a single, concrete use case for this work.  At the time, I
>had
>> forgotten about the Graphs Use Cases page [1], and no one mentioned
>> it.  So I spent some time thinking about it, and talking to Eric
>> Prud'hommeaux.  I haven't yet gone through [1] to determine how each
>of
>> those relates to this analysis, and I'm headed into a meeting that
>will
>> probably stop me returning to this for a while.  So I'm going to just
>> send this now.
>> 
>> It seems to me the driving use case here is the desire to glean
>usable
>> information from imperfect sources.  Reduced to a single word, the
>use
>> case is Trust.  In TimBL's now-ancient Layer Cake vision for the
>> Semantic Web, the top layer is "Web of Trust" or just "Trust" [2]. 
>How
>> can people act based on information they find, when that information
>> might be not be right? How can the system itself help us know what to
>> trust?  Is it possible to make parts of a system more trustworthy
>than
>> the elements on which they rely?  (I think Google has convinced
>> everyone that can be done; can it be done in an open/interoperable
>way?)
>> 
>> Here's my minimal concrete use case:
>> 
>>   Alice wants to find a good, local seafood restaurant.  She has many
>>   ways to find restaurant reviews in RDF -- some embedded in people's
>>   blogs, some exported from sites which help people author reviews,
>>   some exported from sites which extract and aggregrate reviews from
>>   other sites -- and she'd like to know which sources she can trust.
>>   Actually, she'd like the computer to do that for her, and just
>>   analyze the trustworthy data.  Is there a way the Web can convey
>>   metadata about those reviews that lets software assess the relative
>>   reliability of the different sources? 
>> 
>> That's the short version.  For the rest of this message, I'm going
>to:
>> 
>>   1.  Explore reasons the data might not be trustworthy.  Trust isn't
>>       just about lies; it's about all the reasons data might be
>>       imperfect.
>> 
>>   2.  Explore other application domains, showing how the same issues
>>       arise.  This isn't just about seafood restaurants, of course,
>>       or even just about consumers making choices. It's also about
>>       medical research, political processes, corporate IT, etc.
>> 
>>   3.  A few thoughts about solutions.  It's what you'd probably
>>       expect; we need a way in RDF to convey the kind of information
>>       needed to determine the trustworthiness of other RDF
>>       sources. We need to be able to talk about particular
>>       statements, about particular curated collections of statements,
>>       and about the people and organizations behind those statements
>>       and databases.
>> 
>> == (1) Some Reasons Data Is Imperfect ==
>> 
>> There are many reasons why information found in RDF might not be
>> trustworthy.  In many cases it is still useful and may be the best
>> information available.  For simplicity, the reasons are here applied
>> first to the classic example problem of selecting a seafood
>> restaurant.  The reasons have much wider applicability, however, and
>> more application domains are explored in section 2.
>> 
>>    DECEPTION: Alice is trying to find the best local seafood
>>    restaurant using reviews posted by various earlier patrons.  One
>>    restaurant, Mal's Mollusks, attempts to trick her by posting many
>>    positive reviews using fake identities.
>> 
>>    ERROR: Errol tries to post of glowing review of his favorite
>>    restaurant, Mel's Mellon Soups, but accidentally files it under
>>    Mal's.  Alice might be led down the wrong path (to eating at
>>    Mal's) by Errol's mistake.
>> 
>>    SIMPLIFICATION: Simon makes a point of trying a new restaurant
>>    every day, but doesn't like to keep detailed records.  After a
>>    while, he comes to the opinion that all the Seafood restaurants in
>>    town are really quite good.  One day, while visiting a restaurant
>>    review site, he quickly rates them all as such, without bothering
>>    to notice that he's never even tried Mal's.  (He wouldn't consider
>>    this a mistake; for his purposes, this was good enough data.)
>> 
>>    TIME LAG: Mal is actually Mal Jr, having taken over the restaurant
>>    from his father, Mal Sr.  Mal Sr ran a great restaurant (the
>>    finest squid dumplings in Texas), but it's gone steeply downhill
>>    his since Mal Jr took over.  Some of the reviews from the old days
>>    still rightly glow about Mal Sr's restaurant.
>> 
>>    SUBJECTIVITY: Some people actually like Mal Jr's cooking.  There's
>no
>>    accounting for taste, but perhaps the other things these people
>>    like, if Alice knew about them, could give her some clue to
>>    disregard their high opinion of Mal's.
>> 
>> This list of five reasons is not meant to be exhaustive; it's just
>all
>> I could think of today.
>> 
>> == (2) Some Other Problem Domains ==
>> 
>> Trust reasoning comes up in many other problems domain, of course.
>> Here are two more example domains to show how the need for trust
>> reasoning applies beyond selecting reviews of potential partners in
>> commercial transactions.
>> 
>> 
>> Science
>> 
>>     When one researcher (Alice) is considering building on the work
>>     reported by another researcher (Robbie), similar trust issues
>>     arise.  Here, the consequences can be quite serious.
>> 
>>     DECEPTION: Did Robbie falsify results, in order to publish?
>> 
>>     ERROR: Did Robbie (or one of his assistants) make an honest but
>>     undetected mistake?
>> 
>>     SIMPLIFICATION: This may be the hardest to avoid: what
>>     simplifying assumptions did Robbie make?  They may be common in
>>     the field, but perhaps Alice is in a different sub-field, or a
>>     different part of the world, or a different time, when the
>>     assumptions are different.
>> 
>>     TIME LAG: Perhaps Robbie publishes environmental sample data from
>>     his city on a monthly basis.  For studying a larger picture,
>>     Alice may need to know exactly when the samples were taken and
>>     how recent the "current" ones are.
>> 
>>     SUBJECTIVITY: Robbie's work with human subjects was approved by
>>     his university's research ethics board, perhaps their standards
>>     are different from those Alice wants to endorse by building on
>>     them.  Or: Robbie's assistants had to use judgment to classify
>>     some results; another set of assistants might have classified
>>     them differently.
>> 
>> An Employee Directory 
>> 
>>     A large company, formed largely by acquiring smaller companies,
>>     maintains an on-line directory of office locations, phone
>>     numbers, email addresses, job titles, etc, for its millions of
>>     employees across 12 continents, on nine planets :-).  Alice is
>>     trying to use it to find Bob's address, so she can mail him the
>>     hat he left at a meeting at her site.
>> 
>>     DECEPTION: Mallory is engaged in corporate espionage and has
>>     altered the directory for this week so Bob's mail actually goes
>>     to his office; he's waiting for some key blueprints to be
>>     delivered, then he'll change the address back, probably before
>>     Bob notices.  He'll be surprised by the hat.
>> 
>>     ERROR: Charlie, a coder in Bob's division, made an error in his
>>     routine to export that division's phone book upstream; the error
>>     causes truncation of the last character of the building name,
>>     turning Bob's "Building 21" into "Building 2".
>> 
>>     SIMPLIFICATION: Bob actually has two different offices in
>>     different buildings, and works from home most of the time.  He
>>     had to pick one of the phone book.  He'll end up not getting the
>>     hat for an extra week because of this.
>> 
>>     TIME LAG: Bob switched offices 6 months ago.  It took him 2
>>     months to get around to updating the phone book, and the upstream
>>     data flow is only makes it all the way through every six months,
>>     so Alice still sees his old address.
>> 
>>     SUBJECTIVITY: Bob's building has several different names and
>>     nicknames it has acquired over the years.  Bob, and a few others
>>     in his group still call it the "AI Building", so that's what he
>>     put in the phone book.  The new kid in the mail room doesn't know
>>     that term, so the package gets returned or delayed.
>> 
>> There are other areas, of course, that call for trust reasoning, such
>> as:
>> 
>> - political decision making (voting, donating)
>> 
>> - information used to match employers with employees (hiring, job
>>  search)
>> 
>> - information used in expanding ones social network (connecting with
>>  new colleagues, friends, dating)
>> 
>> ... and I'm sure many, many more.  If you need information, you need
>to
>> know if you can trust it.
>> 
>> == (3)  Solutions == 
>> 
>> So what do these example have in common, and how might we address
>them
>> with some standard technology?
>> 
>> In every case, the data consumer (Alice) obtains some information
>(the
>> data) and would benefit from having some additional information (the
>> metadata) which would help her to determine whether or how she can
>> safely rely on the data. 
>> 
>> The metadata might come from the data provider, disclaiming or
>> clarifying it.  It might also come from an intermediary or
>aggregator,
>> saying how and where they got it.  Or it could come from many
>> different kinds of third parties, like ratings agencies, the public,
>> or the information consumer' social network.
>> 
>> I see an interest division in the kinds of metadata:
>> 
>>   1.  is the data she retrieved trustworthy?
>>   2.  is the person/organization who authored that data trustworthy
>>   3.  is the data source (database URL) she retrieved it from
>trustworthy?
>>   4.  is the person/organization who runs the data source
>trustworthy?
>> 
>> These can be quite different.  The people can be trustworthy but
>> run a data source full of admittedly low quality data.  Or a database
>> of data that's mostly correct can have some bad triples in it.
>> 
>> Note that it's possible for metadata to have its own metadata.  For
>> instance, statement S1 may be declared untrustworthy by person P1 who
>> is declared untrustworthy by person P2 who is declared trustworthy in
>> a statement available at source U1, etc, etc.  Ideally there's a
>chain
>> of trustworthyness assertions rooted at a known trustworthy source,
>> but I suspect that will rarely be the case.  More likely, I expect to
>> see a lot of triples that amount to "+1" and "-1" from one source
>> applied to another.  Hopefully there will be more explanation
>> included, and it will be clear whether it's applied to data/content
>(a
>> g-snap), a the database in general, over time, (a g-box), or the
>> data author, or the database maintainer (an agent).
>> 
>> Well, that's all I have time for right now.  Hopefully this will help
>> clarify what some of us are hoping for here.  To be clear, I should
>> say I'm not expect either WG to *solve* these problems, just to give
>> us some building blocks that enable system builders to make some
>> progress on solving them.
>> 
>> One more observations: digging into any of these use cases, it's
>clear
>> to me I can solve that particular one without any standards work
>beyond
>> settling on the vocabulary for that use case.  That is, I can build
>the
>> provenance vocabulary into the application vocabulary.   I think the
>> goal here, however, is to factor that work out, because it's common
>to
>> so many application areas.
>> 
>>     -- Sandro
>> 
>> [0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15
>> [1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC
>> [2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html
>> or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html
>> or http://www.w3.org/2007/03/layerCake.png
>> [3] http://lists.w3.org/Archives/Public/public-rdf-prov/
>> 
>> 

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Received on Thursday, 22 September 2011 19:02:10 UTC