Re: A question on the vocabulary for 'persons'

Chimezie Ogbuji wrote:
> 
> 
> 
> 
> On Thu, 14 Sep 2006, William Bug wrote:
> 
>> Ditto, Kei!!!
>>
>> Of course, at the heart of this - in addition to the very important
>> issue Chemezie introduced re: ACL at the graph node level, if that is
>> practical - is the discussion we've been having regarding URIs - how
>> to create them, broadcast/discover them, and guarantee their uniqueness.
>>
>> The individual tracking issue Kei mentions below is one we've had to
>> deal with on the BIRN project, where different research groups are
>> passing a given subject (or samples from that subject) amongst
>> themselves to perform different sorts of investigation - vital imaging
>> with MRI or fMRI, imaging of dead tissue - the brain - at high rez
>> either with histo-based LM techniques or for some samples EM - also
>> gene expression analysis on matched microdissected tissue punches,
>> ELIZA, etc.
>>
>> There is also the very difficult issue of being able to stream-line
>> the IRB paperwork across campuses which to some extent depends on
>> being able to "publish" subject/sample level IDs.
> 
> Perhaps I'm misunderstanding the specific needs here, but I wonder if
> authoritative identification of individuals is really an argument for a
> ID-oriented naming convention - such as LSID.
> 
> I have a better understanding (than I did before) of what drives the
> need for LSID's from the ongoing discussions, but it's worth mentioning
> that the mechanics of InverseFunctionalProperties (which FOAF uses) can
> provide a means to identify individuals uniquely.  If you label a
> role/property as being inverse functional then you are saying it can
> only be used on *one* individual.  Explicitely:
> 
> {?P a owl:InverseFunctionalProperty. ?X ?P ?O. ?Y ?P ?O} => {?X
> owl:sameAs ?Y}.
> 
> As long as your vocabulary has such identifying roles/properties (FOAF
> uses the persons email address, but it could be any centrally managed
> system-wide identifier - I'm certain most institutions have this) you
> can very easily enable identity reasoning

In some ways, FOAF is agnostic about identification strategies. The
vocabulary still works fine if everyone had LSID-style URIs or personal
URNs of some kind, but is designed to be usable in the absence of such
useful consensus. This design was chosen due to the commercial and
political sensitivities around the deployment of unique IDs for people.
It's a crowded and not always fun space to be working in. Lots of
companies and other players want to offer "the" way to identify people
online (and off). So I tried to take FOAF a little bit "meta", to avoid
conflict with any particular person-identification initiative. Also we
want to be useful for historical (eg. genealogy, or wikipedia) data,
where modern web-era IDs aren't always available, relevant or known.

All that said, it can be really quite painful merging data about a
single person when each data source adopts a different
reference-by-description strategy.

Katie Portwin gave a very interesting short talk touching on this issue
last night at an Oxford-SWIG gathering; see related blog post
http://allmyeye.blogspot.com/2006/09/more-merging-with-sparql.html
...and which got me thinking again about the way FOAF allows pluralism
w.r.t. identification strategies (see [1] for an earlier writeup).

In the FOAF world, I can describe myself as "the person whose homepage
is http://danbri.org/"; someone else might mention me as "the person
whose weblog is http://danbri.org/words/", ...while a 3rd might annotate
an image and describe it as depicting "the person whose mailbox URI when
hashed using the sha1 function is the string 1234432124". This is
friendly to publishers (who can use whatever IDs they have handy), at
the expense of data consumers (who have to go to some lengths to
normalise the data).

The heterogeneity is good in some ways, in that it captures the
real-world heterogeneity of person identification strategies, and the
fact that sometimes you just don't know the other ways in which some
individual is identified. But it can also push a lot of cost onto
aggregators / consumers of the data. So I was thinking of an algorithm /
strategy that might share the burden a bit more fairly between
publishers and consumers.

Example: I'm about to publish some metadata about some people; Alice and
Bob. They wrote a journal article together,  perhaps.

I know 2 of Alice's mailboxes, and hence their hashes too.  I know her
blog and homepage urls. I know a Wikipedia article that has her as its
primaryTopic. For Bob, I know his mailbox, and his msnChatId, his pgp
fingerprint and a URI he ascribes to himself in a his FOAF file.

Which of these do I use in my RDF file describing their publication?

-the verbose answer is: use them all
-the idealistic answer is: use their unique and well-known URIs (or URNs).
-the pragmatic answer I was trhinking about yesterday:

 - use identification strategies which (a) minimise privacy impact (b)
are closest to those we believe are used by the parties themselves.

So: we look at Alice's FOAF file, and see that she mentions her weblog
URI in that file, and not much else. And we look at Bob's, and we see
he's published a URI for himself, and described his PGP fingerprint (and
signed his FOAF file using it), and mentioned his mailbox URI.

For Alice: the weblog URI is the thing to use. For Bob, ... the URI is a
good identifier, as are the PGP fingerprint and mailbox URI. So
depending on our preference for verboseness over our concern to make
sure our RDF is widely understood, we could use any or all of those. The
fact that the mailbox is public already makes it less bad to use it in
our own metadata, ... and the PGP can give us some assurance that it
actually *was* Bob that published his mailbox URI.

None of this is particularly simple, but it's not rocket science either.
The hard bit is bootstrapping: how do we find Alice's FOAF file in first
place? How do we know it really is hers? etc. And there are no easy
answers there.

I've sketched this as a policy for data publishers; but it could also
play a normalisation strategy for harvesters/aggregators who want to do
some pre-computation ahead of (otherwise potentially expensive) queries.

Thinking out loud...

cheers,

Dan

-- 
http://danbri.org/



[1] http://rdfweb.org/mt/foaflog/archives/2003/07/10/12.05.33/

Received on Thursday, 14 September 2006 20:59:55 UTC