- From: Sandro Hawke <sandro@w3.org>
- Date: Tue, 20 Sep 2011 21:54:37 -0700
- To: public-rdf-prov@w3.org
- Cc: public-rdf-wg <public-rdf-wg@w3.org>, Provenance WG <public-prov-wg@w3.org>
[Please reply to public-rdf-prov@w3.org, not either WG lists. If
you're interested in seeing replies, please subscribe to that list or
read its archives [3].]
During the joint RDF/Provenance task force call last week [0], I agreed
to draft a single, concrete use case for this work. At the time, I had
forgotten about the Graphs Use Cases page [1], and no one mentioned
it. So I spent some time thinking about it, and talking to Eric
Prud'hommeaux. I haven't yet gone through [1] to determine how each of
those relates to this analysis, and I'm headed into a meeting that will
probably stop me returning to this for a while. So I'm going to just
send this now.
It seems to me the driving use case here is the desire to glean usable
information from imperfect sources. Reduced to a single word, the use
case is Trust. In TimBL's now-ancient Layer Cake vision for the
Semantic Web, the top layer is "Web of Trust" or just "Trust" [2]. How
can people act based on information they find, when that information
might be not be right? How can the system itself help us know what to
trust? Is it possible to make parts of a system more trustworthy than
the elements on which they rely? (I think Google has convinced
everyone that can be done; can it be done in an open/interoperable way?)
Here's my minimal concrete use case:
Alice wants to find a good, local seafood restaurant. She has many
ways to find restaurant reviews in RDF -- some embedded in people's
blogs, some exported from sites which help people author reviews,
some exported from sites which extract and aggregrate reviews from
other sites -- and she'd like to know which sources she can trust.
Actually, she'd like the computer to do that for her, and just
analyze the trustworthy data. Is there a way the Web can convey
metadata about those reviews that lets software assess the relative
reliability of the different sources?
That's the short version. For the rest of this message, I'm going to:
1. Explore reasons the data might not be trustworthy. Trust isn't
just about lies; it's about all the reasons data might be
imperfect.
2. Explore other application domains, showing how the same issues
arise. This isn't just about seafood restaurants, of course,
or even just about consumers making choices. It's also about
medical research, political processes, corporate IT, etc.
3. A few thoughts about solutions. It's what you'd probably
expect; we need a way in RDF to convey the kind of information
needed to determine the trustworthiness of other RDF
sources. We need to be able to talk about particular
statements, about particular curated collections of statements,
and about the people and organizations behind those statements
and databases.
== (1) Some Reasons Data Is Imperfect ==
There are many reasons why information found in RDF might not be
trustworthy. In many cases it is still useful and may be the best
information available. For simplicity, the reasons are here applied
first to the classic example problem of selecting a seafood
restaurant. The reasons have much wider applicability, however, and
more application domains are explored in section 2.
DECEPTION: Alice is trying to find the best local seafood
restaurant using reviews posted by various earlier patrons. One
restaurant, Mal's Mollusks, attempts to trick her by posting many
positive reviews using fake identities.
ERROR: Errol tries to post of glowing review of his favorite
restaurant, Mel's Mellon Soups, but accidentally files it under
Mal's. Alice might be led down the wrong path (to eating at
Mal's) by Errol's mistake.
SIMPLIFICATION: Simon makes a point of trying a new restaurant
every day, but doesn't like to keep detailed records. After a
while, he comes to the opinion that all the Seafood restaurants in
town are really quite good. One day, while visiting a restaurant
review site, he quickly rates them all as such, without bothering
to notice that he's never even tried Mal's. (He wouldn't consider
this a mistake; for his purposes, this was good enough data.)
TIME LAG: Mal is actually Mal Jr, having taken over the restaurant
from his father, Mal Sr. Mal Sr ran a great restaurant (the
finest squid dumplings in Texas), but it's gone steeply downhill
his since Mal Jr took over. Some of the reviews from the old days
still rightly glow about Mal Sr's restaurant.
SUBJECTIVITY: Some people actually like Mal Jr's cooking. There's no
accounting for taste, but perhaps the other things these people
like, if Alice knew about them, could give her some clue to
disregard their high opinion of Mal's.
This list of five reasons is not meant to be exhaustive; it's just all
I could think of today.
== (2) Some Other Problem Domains ==
Trust reasoning comes up in many other problems domain, of course.
Here are two more example domains to show how the need for trust
reasoning applies beyond selecting reviews of potential partners in
commercial transactions.
Science
When one researcher (Alice) is considering building on the work
reported by another researcher (Robbie), similar trust issues
arise. Here, the consequences can be quite serious.
DECEPTION: Did Robbie falsify results, in order to publish?
ERROR: Did Robbie (or one of his assistants) make an honest but
undetected mistake?
SIMPLIFICATION: This may be the hardest to avoid: what
simplifying assumptions did Robbie make? They may be common in
the field, but perhaps Alice is in a different sub-field, or a
different part of the world, or a different time, when the
assumptions are different.
TIME LAG: Perhaps Robbie publishes environmental sample data from
his city on a monthly basis. For studying a larger picture,
Alice may need to know exactly when the samples were taken and
how recent the "current" ones are.
SUBJECTIVITY: Robbie's work with human subjects was approved by
his university's research ethics board, perhaps their standards
are different from those Alice wants to endorse by building on
them. Or: Robbie's assistants had to use judgment to classify
some results; another set of assistants might have classified
them differently.
An Employee Directory
A large company, formed largely by acquiring smaller companies,
maintains an on-line directory of office locations, phone
numbers, email addresses, job titles, etc, for its millions of
employees across 12 continents, on nine planets :-). Alice is
trying to use it to find Bob's address, so she can mail him the
hat he left at a meeting at her site.
DECEPTION: Mallory is engaged in corporate espionage and has
altered the directory for this week so Bob's mail actually goes
to his office; he's waiting for some key blueprints to be
delivered, then he'll change the address back, probably before
Bob notices. He'll be surprised by the hat.
ERROR: Charlie, a coder in Bob's division, made an error in his
routine to export that division's phone book upstream; the error
causes truncation of the last character of the building name,
turning Bob's "Building 21" into "Building 2".
SIMPLIFICATION: Bob actually has two different offices in
different buildings, and works from home most of the time. He
had to pick one of the phone book. He'll end up not getting the
hat for an extra week because of this.
TIME LAG: Bob switched offices 6 months ago. It took him 2
months to get around to updating the phone book, and the upstream
data flow is only makes it all the way through every six months,
so Alice still sees his old address.
SUBJECTIVITY: Bob's building has several different names and
nicknames it has acquired over the years. Bob, and a few others
in his group still call it the "AI Building", so that's what he
put in the phone book. The new kid in the mail room doesn't know
that term, so the package gets returned or delayed.
There are other areas, of course, that call for trust reasoning, such
as:
- political decision making (voting, donating)
- information used to match employers with employees (hiring, job
search)
- information used in expanding ones social network (connecting with
new colleagues, friends, dating)
... and I'm sure many, many more. If you need information, you need to
know if you can trust it.
== (3) Solutions ==
So what do these example have in common, and how might we address them
with some standard technology?
In every case, the data consumer (Alice) obtains some information (the
data) and would benefit from having some additional information (the
metadata) which would help her to determine whether or how she can
safely rely on the data.
The metadata might come from the data provider, disclaiming or
clarifying it. It might also come from an intermediary or aggregator,
saying how and where they got it. Or it could come from many
different kinds of third parties, like ratings agencies, the public,
or the information consumer' social network.
I see an interest division in the kinds of metadata:
1. is the data she retrieved trustworthy?
2. is the person/organization who authored that data trustworthy
3. is the data source (database URL) she retrieved it from trustworthy?
4. is the person/organization who runs the data source trustworthy?
These can be quite different. The people can be trustworthy but
run a data source full of admittedly low quality data. Or a database
of data that's mostly correct can have some bad triples in it.
Note that it's possible for metadata to have its own metadata. For
instance, statement S1 may be declared untrustworthy by person P1 who
is declared untrustworthy by person P2 who is declared trustworthy in
a statement available at source U1, etc, etc. Ideally there's a chain
of trustworthyness assertions rooted at a known trustworthy source,
but I suspect that will rarely be the case. More likely, I expect to
see a lot of triples that amount to "+1" and "-1" from one source
applied to another. Hopefully there will be more explanation
included, and it will be clear whether it's applied to data/content (a
g-snap), a the database in general, over time, (a g-box), or the
data author, or the database maintainer (an agent).
Well, that's all I have time for right now. Hopefully this will help
clarify what some of us are hoping for here. To be clear, I should
say I'm not expect either WG to *solve* these problems, just to give
us some building blocks that enable system builders to make some
progress on solving them.
One more observations: digging into any of these use cases, it's clear
to me I can solve that particular one without any standards work beyond
settling on the vocabulary for that use case. That is, I can build the
provenance vocabulary into the application vocabulary. I think the
goal here, however, is to factor that work out, because it's common to
so many application areas.
-- Sandro
[0] http://www.w3.org/2011/rdf-wg/meeting/2011-09-15
[1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC
[2] http://www.w3.org/2000/Talks/0906-xmlweb-tbl/slide9-0.html
or http://www.w3.org/2002/Talks/04-sweb/slide12-0.html
or http://www.w3.org/2007/03/layerCake.png
[3] http://lists.w3.org/Archives/Public/public-rdf-prov/
Received on Wednesday, 21 September 2011 04:54:50 UTC