- From: David Wood <david@3roundstones.com>
- Date: Wed, 21 Sep 2011 11:55:36 -0400
- To: public-rdf-prov@w3.org
Hi Sandro, Can you please explain how and whether this use case differs from: http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC#FOAF_Use_Case Thanks. Regards, Dave On Sep 21, 2011, at 24:54, Sandro Hawke wrote: > [Please reply to public-rdf-prov@w3.org, not either WG lists. If > you're interested in seeing replies, please subscribe to that list or > read its archives [3].] > > During the joint RDF/Provenance task force call last week [0], I agreed > to draft a single, concrete use case for this work. At the time, I had > forgotten about the Graphs Use Cases page [1], and no one mentioned > it. So I spent some time thinking about it, and talking to Eric > Prud'hommeaux. I haven't yet gone through [1] to determine how each of > those relates to this analysis, and I'm headed into a meeting that will > probably stop me returning to this for a while. So I'm going to just > send this now. > > It seems to me the driving use case here is the desire to glean usable > information from imperfect sources. Reduced to a single word, the use > case is Trust. In TimBL's now-ancient Layer Cake vision for the > Semantic Web, the top layer is "Web of Trust" or just "Trust" [2]. How > can people act based on information they find, when that information > might be not be right? How can the system itself help us know what to > trust? Is it possible to make parts of a system more trustworthy than > the elements on which they rely? (I think Google has convinced > everyone that can be done; can it be done in an open/interoperable way?) > > Here's my minimal concrete use case: > > Alice wants to find a good, local seafood restaurant. She has many > ways to find restaurant reviews in RDF -- some embedded in people's > blogs, some exported from sites which help people author reviews, > some exported from sites which extract and aggregrate reviews from > other sites -- and she'd like to know which sources she can trust. > Actually, she'd like the computer to do that for her, and just > analyze the trustworthy data. Is there a way the Web can convey > metadata about those reviews that lets software assess the relative > reliability of the different sources? > > That's the short version. For the rest of this message, I'm going to: > > 1. Explore reasons the data might not be trustworthy. Trust isn't > just about lies; it's about all the reasons data might be > imperfect. > > 2. Explore other application domains, showing how the same issues > arise. This isn't just about seafood restaurants, of course, > or even just about consumers making choices. It's also about > medical research, political processes, corporate IT, etc. > > 3. A few thoughts about solutions. It's what you'd probably > expect; we need a way in RDF to convey the kind of information > needed to determine the trustworthiness of other RDF > sources. We need to be able to talk about particular > statements, about particular curated collections of statements, > and about the people and organizations behind those statements > and databases. > > == (1) Some Reasons Data Is Imperfect == > > There are many reasons why information found in RDF might not be > trustworthy. In many cases it is still useful and may be the best > information available. For simplicity, the reasons are here applied > first to the classic example problem of selecting a seafood > restaurant. The reasons have much wider applicability, however, and > more application domains are explored in section 2. > > DECEPTION: Alice is trying to find the best local seafood > restaurant using reviews posted by various earlier patrons. One > restaurant, Mal's Mollusks, attempts to trick her by posting many > positive reviews using fake identities. > > ERROR: Errol tries to post of glowing review of his favorite > restaurant, Mel's Mellon Soups, but accidentally files it under > Mal's. Alice might be led down the wrong path (to eating at > Mal's) by Errol's mistake. > > SIMPLIFICATION: Simon makes a point of trying a new restaurant > every day, but doesn't like to keep detailed records. After a > while, he comes to the opinion that all the Seafood restaurants in > town are really quite good. One day, while visiting a restaurant > review site, he quickly rates them all as such, without bothering > to notice that he's never even tried Mal's. (He wouldn't consider > this a mistake; for his purposes, this was good enough data.) > > TIME LAG: Mal is actually Mal Jr, having taken over the restaurant > from his father, Mal Sr. Mal Sr ran a great restaurant (the > finest squid dumplings in Texas), but it's gone steeply downhill > his since Mal Jr took over. Some of the reviews from the old days > still rightly glow about Mal Sr's restaurant. > > SUBJECTIVITY: Some people actually like Mal Jr's cooking. There's no > accounting for taste, but perhaps the other things these people > like, if Alice knew about them, could give her some clue to > disregard their high opinion of Mal's. > > This list of five reasons is not meant to be exhaustive; it's just all > I could think of today. > > == (2) Some Other Problem Domains == > > Trust reasoning comes up in many other problems domain, of course. > Here are two more example domains to show how the need for trust > reasoning applies beyond selecting reviews of potential partners in > commercial transactions. > > > Science > > When one researcher (Alice) is considering building on the work > reported by another researcher (Robbie), similar trust issues > arise. Here, the consequences can be quite serious. > > DECEPTION: Did Robbie falsify results, in order to publish? > > ERROR: Did Robbie (or one of his assistants) make an honest but > undetected mistake? > > SIMPLIFICATION: This may be the hardest to avoid: what > simplifying assumptions did Robbie make? They may be common in > the field, but perhaps Alice is in a different sub-field, or a > different part of the world, or a different time, when the > assumptions are different. > > TIME LAG: Perhaps Robbie publishes environmental sample data from > his city on a monthly basis. For studying a larger picture, > Alice may need to know exactly when the samples were taken and > how recent the "current" ones are. > > SUBJECTIVITY: Robbie's work with human subjects was approved by > his university's research ethics board, perhaps their standards > are different from those Alice wants to endorse by building on > them. Or: Robbie's assistants had to use judgment to classify > some results; another set of assistants might have classified > them differently. > > An Employee Directory > > A large company, formed largely by acquiring smaller companies, > maintains an on-line directory of office locations, phone > numbers, email addresses, job titles, etc, for its millions of > employees across 12 continents, on nine planets :-). Alice is > trying to use it to find Bob's address, so she can mail him the > hat he left at a meeting at her site. > > DECEPTION: Mallory is engaged in corporate espionage and has > altered the directory for this week so Bob's mail actually goes > to his office; he's waiting for some key blueprints to be > delivered, then he'll change the address back, probably before > Bob notices. He'll be surprised by the hat. > > ERROR: Charlie, a coder in Bob's division, made an error in his > routine to export that division's phone book upstream; the error > causes truncation of the last character of the building name, > turning Bob's "Building 21" into "Building 2". > > SIMPLIFICATION: Bob actually has two different offices in > different buildings, and works from home most of the time. He > had to pick one of the phone book. He'll end up not getting the > hat for an extra week because of this. > > TIME LAG: Bob switched offices 6 months ago. It took him 2 > months to get around to updating the phone book, and the upstream > data flow is only makes it all the way through every six months, > so Alice still sees his old address. > > SUBJECTIVITY: Bob's building has several different names and > nicknames it has acquired over the years. Bob, and a few others > in his group still call it the "AI Building", so that's what he > put in the phone book. The new kid in the mail room doesn't know > that term, so the package gets returned or delayed. > > There are other areas, of course, that call for trust reasoning, such > as: > > - political decision making (voting, donating) > > - information used to match employers with employees (hiring, job > search) > > - information used in expanding ones social network (connecting with > new colleagues, friends, dating) > > ... and I'm sure many, many more. If you need information, you need to > know if you can trust it. > > == (3) Solutions == > > So what do these example have in common, and how might we address them > with some standard technology? > > In every case, the data consumer (Alice) obtains some information (the > data) and would benefit from having some additional information (the > metadata) which would help her to determine whether or how she can > safely rely on the data. > > The metadata might come from the data provider, disclaiming or > clarifying it. It might also come from an intermediary or aggregator, > saying how and where they got it. Or it could come from many > different kinds of third parties, like ratings agencies, the public, > or the information consumer' social network. > > I see an interest division in the kinds of metadata: > > 1. is the data she retrieved trustworthy? > 2. is the person/organization who authored that data trustworthy > 3. is the data source (database URL) she retrieved it from trustworthy? > 4. is the person/organization who runs the data source trustworthy? > > These can be quite different. The people can be trustworthy but > run a data source full of admittedly low quality data. Or a database > of data that's mostly correct can have some bad triples in it. > > Note that it's possible for metadata to have its own metadata. For > instance, statement S1 may be declared untrustworthy by person P1 who > is declared untrustworthy by person P2 who is declared trustworthy in > a statement available at source U1, etc, etc. Ideally there's a chain > of trustworthyness assertions rooted at a known trustworthy source, > but I suspect that will rarely be the case. More likely, I expect to > see a lot of triples that amount to "+1" and "-1" from one source > applied to another. Hopefully there will be more explanation > included, and it will be clear whether it's applied to data/content (a > g-snap), a the database in general, over time, (a g-box), or the > data author, or the database maintainer (an agent). > > Well, that's all I have time for right now. Hopefully this will help > clarify what some of us are hoping for here. To be clear, I should > say I'm not expect either WG to *solve* these problems, just to give > us some building blocks that enable system builders to make some > progress on solving them. > > One more observations: digging into any of these use cases, it's clear > to me I can solve that particular one without any standards work beyond > settling on the vocabulary for that use case. That is, I can build the > provenance vocabulary into the application vocabulary. I think the > goal here, however, is to factor that work out, because it's common to > so many application areas. > > -- Sandro > > [0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15 > [1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC > [2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html > or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html > or http://www.w3.org/2007/03/layerCake.png > [3] http://lists.w3.org/Archives/Public/public-rdf-prov/ > >
Received on Wednesday, 21 September 2011 15:56:17 UTC