- From: Sandro Hawke <sandro@w3.org>
- Date: Thu, 22 Sep 2011 12:02:06 -0700
- To: Richard Cyganiak <richard@cyganiak.de>,public-rdf-prov@w3.org
Richard Cyganiak <richard@cyganiak.de> wrote: >Hi Sandro, > >Thanks for sharing this. > >Assuming the data is retrieved from the web (that is, all data is >received as a representation of some resource that has a URL), then I >believe that all these issues can be solved using three ingredients: > >1. a graph store for named graphs, >2. vocabulary for expressing authorship, trust/reputation and >source/mirror relationships, >3. an incentive for parties on the web to publish trust/reputation >information. I didn't call this out in my examples, but how do you handle the cases where data changes? How can I say that Errol got the name wrong, in a way which won't make me wrong if he corrects himself? -- Sandro (walking slowly down a jetway it SFO :-) >I think that 1) already is standardized, 2) is to a large degree on the >charter of the Provenance WG, and 3) is, well, the tough one. > >It's not quite clear to me what role you see for the RDF WG in this? > >In particular, I was hoping for some rationale for your position that >RDF datasets as defined in SPARQL are insufficient to cover use cases >for working with multiple graphs in RDF. As far as I can tell, the use >cases you describe don't even require working with multiple graphs; >they just require the ability to make statements about web resources. >What requirements arise from these use cases that are not met by RDF >datasets? > >Thanks, >Richard > > >On 21 Sep 2011, at 05:54, Sandro Hawke wrote: > >> [Please reply to public-rdf-prov@w3.org, not either WG lists. If >> you're interested in seeing replies, please subscribe to that list or >> read its archives [3].] >> >> During the joint RDF/Provenance task force call last week [0], I >agreed >> to draft a single, concrete use case for this work. At the time, I >had >> forgotten about the Graphs Use Cases page [1], and no one mentioned >> it. So I spent some time thinking about it, and talking to Eric >> Prud'hommeaux. I haven't yet gone through [1] to determine how each >of >> those relates to this analysis, and I'm headed into a meeting that >will >> probably stop me returning to this for a while. So I'm going to just >> send this now. >> >> It seems to me the driving use case here is the desire to glean >usable >> information from imperfect sources. Reduced to a single word, the >use >> case is Trust. In TimBL's now-ancient Layer Cake vision for the >> Semantic Web, the top layer is "Web of Trust" or just "Trust" [2]. >How >> can people act based on information they find, when that information >> might be not be right? How can the system itself help us know what to >> trust? Is it possible to make parts of a system more trustworthy >than >> the elements on which they rely? (I think Google has convinced >> everyone that can be done; can it be done in an open/interoperable >way?) >> >> Here's my minimal concrete use case: >> >> Alice wants to find a good, local seafood restaurant. She has many >> ways to find restaurant reviews in RDF -- some embedded in people's >> blogs, some exported from sites which help people author reviews, >> some exported from sites which extract and aggregrate reviews from >> other sites -- and she'd like to know which sources she can trust. >> Actually, she'd like the computer to do that for her, and just >> analyze the trustworthy data. Is there a way the Web can convey >> metadata about those reviews that lets software assess the relative >> reliability of the different sources? >> >> That's the short version. For the rest of this message, I'm going >to: >> >> 1. Explore reasons the data might not be trustworthy. Trust isn't >> just about lies; it's about all the reasons data might be >> imperfect. >> >> 2. Explore other application domains, showing how the same issues >> arise. This isn't just about seafood restaurants, of course, >> or even just about consumers making choices. It's also about >> medical research, political processes, corporate IT, etc. >> >> 3. A few thoughts about solutions. It's what you'd probably >> expect; we need a way in RDF to convey the kind of information >> needed to determine the trustworthiness of other RDF >> sources. We need to be able to talk about particular >> statements, about particular curated collections of statements, >> and about the people and organizations behind those statements >> and databases. >> >> == (1) Some Reasons Data Is Imperfect == >> >> There are many reasons why information found in RDF might not be >> trustworthy. In many cases it is still useful and may be the best >> information available. For simplicity, the reasons are here applied >> first to the classic example problem of selecting a seafood >> restaurant. The reasons have much wider applicability, however, and >> more application domains are explored in section 2. >> >> DECEPTION: Alice is trying to find the best local seafood >> restaurant using reviews posted by various earlier patrons. One >> restaurant, Mal's Mollusks, attempts to trick her by posting many >> positive reviews using fake identities. >> >> ERROR: Errol tries to post of glowing review of his favorite >> restaurant, Mel's Mellon Soups, but accidentally files it under >> Mal's. Alice might be led down the wrong path (to eating at >> Mal's) by Errol's mistake. >> >> SIMPLIFICATION: Simon makes a point of trying a new restaurant >> every day, but doesn't like to keep detailed records. After a >> while, he comes to the opinion that all the Seafood restaurants in >> town are really quite good. One day, while visiting a restaurant >> review site, he quickly rates them all as such, without bothering >> to notice that he's never even tried Mal's. (He wouldn't consider >> this a mistake; for his purposes, this was good enough data.) >> >> TIME LAG: Mal is actually Mal Jr, having taken over the restaurant >> from his father, Mal Sr. Mal Sr ran a great restaurant (the >> finest squid dumplings in Texas), but it's gone steeply downhill >> his since Mal Jr took over. Some of the reviews from the old days >> still rightly glow about Mal Sr's restaurant. >> >> SUBJECTIVITY: Some people actually like Mal Jr's cooking. There's >no >> accounting for taste, but perhaps the other things these people >> like, if Alice knew about them, could give her some clue to >> disregard their high opinion of Mal's. >> >> This list of five reasons is not meant to be exhaustive; it's just >all >> I could think of today. >> >> == (2) Some Other Problem Domains == >> >> Trust reasoning comes up in many other problems domain, of course. >> Here are two more example domains to show how the need for trust >> reasoning applies beyond selecting reviews of potential partners in >> commercial transactions. >> >> >> Science >> >> When one researcher (Alice) is considering building on the work >> reported by another researcher (Robbie), similar trust issues >> arise. Here, the consequences can be quite serious. >> >> DECEPTION: Did Robbie falsify results, in order to publish? >> >> ERROR: Did Robbie (or one of his assistants) make an honest but >> undetected mistake? >> >> SIMPLIFICATION: This may be the hardest to avoid: what >> simplifying assumptions did Robbie make? They may be common in >> the field, but perhaps Alice is in a different sub-field, or a >> different part of the world, or a different time, when the >> assumptions are different. >> >> TIME LAG: Perhaps Robbie publishes environmental sample data from >> his city on a monthly basis. For studying a larger picture, >> Alice may need to know exactly when the samples were taken and >> how recent the "current" ones are. >> >> SUBJECTIVITY: Robbie's work with human subjects was approved by >> his university's research ethics board, perhaps their standards >> are different from those Alice wants to endorse by building on >> them. Or: Robbie's assistants had to use judgment to classify >> some results; another set of assistants might have classified >> them differently. >> >> An Employee Directory >> >> A large company, formed largely by acquiring smaller companies, >> maintains an on-line directory of office locations, phone >> numbers, email addresses, job titles, etc, for its millions of >> employees across 12 continents, on nine planets :-). Alice is >> trying to use it to find Bob's address, so she can mail him the >> hat he left at a meeting at her site. >> >> DECEPTION: Mallory is engaged in corporate espionage and has >> altered the directory for this week so Bob's mail actually goes >> to his office; he's waiting for some key blueprints to be >> delivered, then he'll change the address back, probably before >> Bob notices. He'll be surprised by the hat. >> >> ERROR: Charlie, a coder in Bob's division, made an error in his >> routine to export that division's phone book upstream; the error >> causes truncation of the last character of the building name, >> turning Bob's "Building 21" into "Building 2". >> >> SIMPLIFICATION: Bob actually has two different offices in >> different buildings, and works from home most of the time. He >> had to pick one of the phone book. He'll end up not getting the >> hat for an extra week because of this. >> >> TIME LAG: Bob switched offices 6 months ago. It took him 2 >> months to get around to updating the phone book, and the upstream >> data flow is only makes it all the way through every six months, >> so Alice still sees his old address. >> >> SUBJECTIVITY: Bob's building has several different names and >> nicknames it has acquired over the years. Bob, and a few others >> in his group still call it the "AI Building", so that's what he >> put in the phone book. The new kid in the mail room doesn't know >> that term, so the package gets returned or delayed. >> >> There are other areas, of course, that call for trust reasoning, such >> as: >> >> - political decision making (voting, donating) >> >> - information used to match employers with employees (hiring, job >> search) >> >> - information used in expanding ones social network (connecting with >> new colleagues, friends, dating) >> >> ... and I'm sure many, many more. If you need information, you need >to >> know if you can trust it. >> >> == (3) Solutions == >> >> So what do these example have in common, and how might we address >them >> with some standard technology? >> >> In every case, the data consumer (Alice) obtains some information >(the >> data) and would benefit from having some additional information (the >> metadata) which would help her to determine whether or how she can >> safely rely on the data. >> >> The metadata might come from the data provider, disclaiming or >> clarifying it. It might also come from an intermediary or >aggregator, >> saying how and where they got it. Or it could come from many >> different kinds of third parties, like ratings agencies, the public, >> or the information consumer' social network. >> >> I see an interest division in the kinds of metadata: >> >> 1. is the data she retrieved trustworthy? >> 2. is the person/organization who authored that data trustworthy >> 3. is the data source (database URL) she retrieved it from >trustworthy? >> 4. is the person/organization who runs the data source >trustworthy? >> >> These can be quite different. The people can be trustworthy but >> run a data source full of admittedly low quality data. Or a database >> of data that's mostly correct can have some bad triples in it. >> >> Note that it's possible for metadata to have its own metadata. For >> instance, statement S1 may be declared untrustworthy by person P1 who >> is declared untrustworthy by person P2 who is declared trustworthy in >> a statement available at source U1, etc, etc. Ideally there's a >chain >> of trustworthyness assertions rooted at a known trustworthy source, >> but I suspect that will rarely be the case. More likely, I expect to >> see a lot of triples that amount to "+1" and "-1" from one source >> applied to another. Hopefully there will be more explanation >> included, and it will be clear whether it's applied to data/content >(a >> g-snap), a the database in general, over time, (a g-box), or the >> data author, or the database maintainer (an agent). >> >> Well, that's all I have time for right now. Hopefully this will help >> clarify what some of us are hoping for here. To be clear, I should >> say I'm not expect either WG to *solve* these problems, just to give >> us some building blocks that enable system builders to make some >> progress on solving them. >> >> One more observations: digging into any of these use cases, it's >clear >> to me I can solve that particular one without any standards work >beyond >> settling on the vocabulary for that use case. That is, I can build >the >> provenance vocabulary into the application vocabulary. I think the >> goal here, however, is to factor that work out, because it's common >to >> so many application areas. >> >> -- Sandro >> >> [0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15 >> [1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC >> [2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html >> or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html >> or http://www.w3.org/2007/03/layerCake.png >> [3] http://lists.w3.org/Archives/Public/public-rdf-prov/ >> >> -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Received on Thursday, 22 September 2011 19:02:10 UTC