- From: Dan Brickley <danbri@danbri.org>
- Date: Thu, 14 Sep 2006 22:02:47 +0100
- To: Chimezie Ogbuji <ogbujic@bio.ri.ccf.org>
- Cc: William Bug <William.Bug@drexelmed.edu>, Marco Brandizi <brandizi@ebi.ac.uk>, w3c semweb hcls <public-semweb-lifesci@w3.org>
Chimezie Ogbuji wrote: > > > > > On Thu, 14 Sep 2006, William Bug wrote: > >> Ditto, Kei!!! >> >> Of course, at the heart of this - in addition to the very important >> issue Chemezie introduced re: ACL at the graph node level, if that is >> practical - is the discussion we've been having regarding URIs - how >> to create them, broadcast/discover them, and guarantee their uniqueness. >> >> The individual tracking issue Kei mentions below is one we've had to >> deal with on the BIRN project, where different research groups are >> passing a given subject (or samples from that subject) amongst >> themselves to perform different sorts of investigation - vital imaging >> with MRI or fMRI, imaging of dead tissue - the brain - at high rez >> either with histo-based LM techniques or for some samples EM - also >> gene expression analysis on matched microdissected tissue punches, >> ELIZA, etc. >> >> There is also the very difficult issue of being able to stream-line >> the IRB paperwork across campuses which to some extent depends on >> being able to "publish" subject/sample level IDs. > > Perhaps I'm misunderstanding the specific needs here, but I wonder if > authoritative identification of individuals is really an argument for a > ID-oriented naming convention - such as LSID. > > I have a better understanding (than I did before) of what drives the > need for LSID's from the ongoing discussions, but it's worth mentioning > that the mechanics of InverseFunctionalProperties (which FOAF uses) can > provide a means to identify individuals uniquely. If you label a > role/property as being inverse functional then you are saying it can > only be used on *one* individual. Explicitely: > > {?P a owl:InverseFunctionalProperty. ?X ?P ?O. ?Y ?P ?O} => {?X > owl:sameAs ?Y}. > > As long as your vocabulary has such identifying roles/properties (FOAF > uses the persons email address, but it could be any centrally managed > system-wide identifier - I'm certain most institutions have this) you > can very easily enable identity reasoning In some ways, FOAF is agnostic about identification strategies. The vocabulary still works fine if everyone had LSID-style URIs or personal URNs of some kind, but is designed to be usable in the absence of such useful consensus. This design was chosen due to the commercial and political sensitivities around the deployment of unique IDs for people. It's a crowded and not always fun space to be working in. Lots of companies and other players want to offer "the" way to identify people online (and off). So I tried to take FOAF a little bit "meta", to avoid conflict with any particular person-identification initiative. Also we want to be useful for historical (eg. genealogy, or wikipedia) data, where modern web-era IDs aren't always available, relevant or known. All that said, it can be really quite painful merging data about a single person when each data source adopts a different reference-by-description strategy. Katie Portwin gave a very interesting short talk touching on this issue last night at an Oxford-SWIG gathering; see related blog post http://allmyeye.blogspot.com/2006/09/more-merging-with-sparql.html ...and which got me thinking again about the way FOAF allows pluralism w.r.t. identification strategies (see [1] for an earlier writeup). In the FOAF world, I can describe myself as "the person whose homepage is http://danbri.org/"; someone else might mention me as "the person whose weblog is http://danbri.org/words/", ...while a 3rd might annotate an image and describe it as depicting "the person whose mailbox URI when hashed using the sha1 function is the string 1234432124". This is friendly to publishers (who can use whatever IDs they have handy), at the expense of data consumers (who have to go to some lengths to normalise the data). The heterogeneity is good in some ways, in that it captures the real-world heterogeneity of person identification strategies, and the fact that sometimes you just don't know the other ways in which some individual is identified. But it can also push a lot of cost onto aggregators / consumers of the data. So I was thinking of an algorithm / strategy that might share the burden a bit more fairly between publishers and consumers. Example: I'm about to publish some metadata about some people; Alice and Bob. They wrote a journal article together, perhaps. I know 2 of Alice's mailboxes, and hence their hashes too. I know her blog and homepage urls. I know a Wikipedia article that has her as its primaryTopic. For Bob, I know his mailbox, and his msnChatId, his pgp fingerprint and a URI he ascribes to himself in a his FOAF file. Which of these do I use in my RDF file describing their publication? -the verbose answer is: use them all -the idealistic answer is: use their unique and well-known URIs (or URNs). -the pragmatic answer I was trhinking about yesterday: - use identification strategies which (a) minimise privacy impact (b) are closest to those we believe are used by the parties themselves. So: we look at Alice's FOAF file, and see that she mentions her weblog URI in that file, and not much else. And we look at Bob's, and we see he's published a URI for himself, and described his PGP fingerprint (and signed his FOAF file using it), and mentioned his mailbox URI. For Alice: the weblog URI is the thing to use. For Bob, ... the URI is a good identifier, as are the PGP fingerprint and mailbox URI. So depending on our preference for verboseness over our concern to make sure our RDF is widely understood, we could use any or all of those. The fact that the mailbox is public already makes it less bad to use it in our own metadata, ... and the PGP can give us some assurance that it actually *was* Bob that published his mailbox URI. None of this is particularly simple, but it's not rocket science either. The hard bit is bootstrapping: how do we find Alice's FOAF file in first place? How do we know it really is hers? etc. And there are no easy answers there. I've sketched this as a policy for data publishers; but it could also play a normalisation strategy for harvesters/aggregators who want to do some pre-computation ahead of (otherwise potentially expensive) queries. Thinking out loud... cheers, Dan -- http://danbri.org/ [1] http://rdfweb.org/mt/foaflog/archives/2003/07/10/12.05.33/
Received on Thursday, 14 September 2006 20:59:55 UTC