- From: Sandro Hawke <sandro@w3.org>
- Date: Tue, 20 Sep 2011 21:54:37 -0700
- To: public-rdf-prov@w3.org
- Cc: public-rdf-wg <public-rdf-wg@w3.org>, Provenance WG <public-prov-wg@w3.org>
[Please reply to public-rdf-prov@w3.org, not either WG lists. If you're interested in seeing replies, please subscribe to that list or read its archives [3].] During the joint RDF/Provenance task force call last week [0], I agreed to draft a single, concrete use case for this work. At the time, I had forgotten about the Graphs Use Cases page [1], and no one mentioned it. So I spent some time thinking about it, and talking to Eric Prud'hommeaux. I haven't yet gone through [1] to determine how each of those relates to this analysis, and I'm headed into a meeting that will probably stop me returning to this for a while. So I'm going to just send this now. It seems to me the driving use case here is the desire to glean usable information from imperfect sources. Reduced to a single word, the use case is Trust. In TimBL's now-ancient Layer Cake vision for the Semantic Web, the top layer is "Web of Trust" or just "Trust" [2]. How can people act based on information they find, when that information might be not be right? How can the system itself help us know what to trust? Is it possible to make parts of a system more trustworthy than the elements on which they rely? (I think Google has convinced everyone that can be done; can it be done in an open/interoperable way?) Here's my minimal concrete use case: Alice wants to find a good, local seafood restaurant. She has many ways to find restaurant reviews in RDF -- some embedded in people's blogs, some exported from sites which help people author reviews, some exported from sites which extract and aggregrate reviews from other sites -- and she'd like to know which sources she can trust. Actually, she'd like the computer to do that for her, and just analyze the trustworthy data. Is there a way the Web can convey metadata about those reviews that lets software assess the relative reliability of the different sources? That's the short version. For the rest of this message, I'm going to: 1. Explore reasons the data might not be trustworthy. Trust isn't just about lies; it's about all the reasons data might be imperfect. 2. Explore other application domains, showing how the same issues arise. This isn't just about seafood restaurants, of course, or even just about consumers making choices. It's also about medical research, political processes, corporate IT, etc. 3. A few thoughts about solutions. It's what you'd probably expect; we need a way in RDF to convey the kind of information needed to determine the trustworthiness of other RDF sources. We need to be able to talk about particular statements, about particular curated collections of statements, and about the people and organizations behind those statements and databases. == (1) Some Reasons Data Is Imperfect == There are many reasons why information found in RDF might not be trustworthy. In many cases it is still useful and may be the best information available. For simplicity, the reasons are here applied first to the classic example problem of selecting a seafood restaurant. The reasons have much wider applicability, however, and more application domains are explored in section 2. DECEPTION: Alice is trying to find the best local seafood restaurant using reviews posted by various earlier patrons. One restaurant, Mal's Mollusks, attempts to trick her by posting many positive reviews using fake identities. ERROR: Errol tries to post of glowing review of his favorite restaurant, Mel's Mellon Soups, but accidentally files it under Mal's. Alice might be led down the wrong path (to eating at Mal's) by Errol's mistake. SIMPLIFICATION: Simon makes a point of trying a new restaurant every day, but doesn't like to keep detailed records. After a while, he comes to the opinion that all the Seafood restaurants in town are really quite good. One day, while visiting a restaurant review site, he quickly rates them all as such, without bothering to notice that he's never even tried Mal's. (He wouldn't consider this a mistake; for his purposes, this was good enough data.) TIME LAG: Mal is actually Mal Jr, having taken over the restaurant from his father, Mal Sr. Mal Sr ran a great restaurant (the finest squid dumplings in Texas), but it's gone steeply downhill his since Mal Jr took over. Some of the reviews from the old days still rightly glow about Mal Sr's restaurant. SUBJECTIVITY: Some people actually like Mal Jr's cooking. There's no accounting for taste, but perhaps the other things these people like, if Alice knew about them, could give her some clue to disregard their high opinion of Mal's. This list of five reasons is not meant to be exhaustive; it's just all I could think of today. == (2) Some Other Problem Domains == Trust reasoning comes up in many other problems domain, of course. Here are two more example domains to show how the need for trust reasoning applies beyond selecting reviews of potential partners in commercial transactions. Science When one researcher (Alice) is considering building on the work reported by another researcher (Robbie), similar trust issues arise. Here, the consequences can be quite serious. DECEPTION: Did Robbie falsify results, in order to publish? ERROR: Did Robbie (or one of his assistants) make an honest but undetected mistake? SIMPLIFICATION: This may be the hardest to avoid: what simplifying assumptions did Robbie make? They may be common in the field, but perhaps Alice is in a different sub-field, or a different part of the world, or a different time, when the assumptions are different. TIME LAG: Perhaps Robbie publishes environmental sample data from his city on a monthly basis. For studying a larger picture, Alice may need to know exactly when the samples were taken and how recent the "current" ones are. SUBJECTIVITY: Robbie's work with human subjects was approved by his university's research ethics board, perhaps their standards are different from those Alice wants to endorse by building on them. Or: Robbie's assistants had to use judgment to classify some results; another set of assistants might have classified them differently. An Employee Directory A large company, formed largely by acquiring smaller companies, maintains an on-line directory of office locations, phone numbers, email addresses, job titles, etc, for its millions of employees across 12 continents, on nine planets :-). Alice is trying to use it to find Bob's address, so she can mail him the hat he left at a meeting at her site. DECEPTION: Mallory is engaged in corporate espionage and has altered the directory for this week so Bob's mail actually goes to his office; he's waiting for some key blueprints to be delivered, then he'll change the address back, probably before Bob notices. He'll be surprised by the hat. ERROR: Charlie, a coder in Bob's division, made an error in his routine to export that division's phone book upstream; the error causes truncation of the last character of the building name, turning Bob's "Building 21" into "Building 2". SIMPLIFICATION: Bob actually has two different offices in different buildings, and works from home most of the time. He had to pick one of the phone book. He'll end up not getting the hat for an extra week because of this. TIME LAG: Bob switched offices 6 months ago. It took him 2 months to get around to updating the phone book, and the upstream data flow is only makes it all the way through every six months, so Alice still sees his old address. SUBJECTIVITY: Bob's building has several different names and nicknames it has acquired over the years. Bob, and a few others in his group still call it the "AI Building", so that's what he put in the phone book. The new kid in the mail room doesn't know that term, so the package gets returned or delayed. There are other areas, of course, that call for trust reasoning, such as: - political decision making (voting, donating) - information used to match employers with employees (hiring, job search) - information used in expanding ones social network (connecting with new colleagues, friends, dating) ... and I'm sure many, many more. If you need information, you need to know if you can trust it. == (3) Solutions == So what do these example have in common, and how might we address them with some standard technology? In every case, the data consumer (Alice) obtains some information (the data) and would benefit from having some additional information (the metadata) which would help her to determine whether or how she can safely rely on the data. The metadata might come from the data provider, disclaiming or clarifying it. It might also come from an intermediary or aggregator, saying how and where they got it. Or it could come from many different kinds of third parties, like ratings agencies, the public, or the information consumer' social network. I see an interest division in the kinds of metadata: 1. is the data she retrieved trustworthy? 2. is the person/organization who authored that data trustworthy 3. is the data source (database URL) she retrieved it from trustworthy? 4. is the person/organization who runs the data source trustworthy? These can be quite different. The people can be trustworthy but run a data source full of admittedly low quality data. Or a database of data that's mostly correct can have some bad triples in it. Note that it's possible for metadata to have its own metadata. For instance, statement S1 may be declared untrustworthy by person P1 who is declared untrustworthy by person P2 who is declared trustworthy in a statement available at source U1, etc, etc. Ideally there's a chain of trustworthyness assertions rooted at a known trustworthy source, but I suspect that will rarely be the case. More likely, I expect to see a lot of triples that amount to "+1" and "-1" from one source applied to another. Hopefully there will be more explanation included, and it will be clear whether it's applied to data/content (a g-snap), a the database in general, over time, (a g-box), or the data author, or the database maintainer (an agent). Well, that's all I have time for right now. Hopefully this will help clarify what some of us are hoping for here. To be clear, I should say I'm not expect either WG to *solve* these problems, just to give us some building blocks that enable system builders to make some progress on solving them. One more observations: digging into any of these use cases, it's clear to me I can solve that particular one without any standards work beyond settling on the vocabulary for that use case. That is, I can build the provenance vocabulary into the application vocabulary. I think the goal here, however, is to factor that work out, because it's common to so many application areas. -- Sandro [0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15 [1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC [2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html or http://www.w3.org/2007/03/layerCake.png [3] http://lists.w3.org/Archives/Public/public-rdf-prov/
Received on Wednesday, 21 September 2011 04:54:46 UTC