W3C home > Mailing lists > Public > public-xg-lld@w3.org > May 2011

Re: Medline RDF -- Citations

From: William Waites <ww@styx.org>
Date: Sun, 8 May 2011 12:33:11 +0200
To: david.shotton@zoo.ox.ac.uk, public-xg-lld <public-xg-lld@w3.org>
Cc: Peter Murray-Rust <pm286@cam.ac.uk>, Ben O'Steen <bosteen@gmail.com>, mark@odaesa.com
Message-ID: <20110508103311.GD31006@styx.org>
Hello David and fellow LLD XG members,

I'm copying you now because this is straying into your citation work.
So I've been improving the Medline RDF data that we now have, and have
seen that there is some quite rich citation information in there. So
my first idea was to try to use CITO to represent it. CITO sounds like
BIBO and I like BIBO so I wanted to use them together. And immediately
I ran into a problem that, unless I have thoroughly misunderstood
something, which is quite possible, is quite deep.

I've also copied in the LLD XG not because it's relevant and because I
suspect that there is a lot of knowledge there that can help model
this properly.
 
Take, for example, this XML fragment,

  <CommentsCorrections RefType="Cites">
  <RefSource>Psychosom Med. 2008 Jun;70(5):539-45</RefSource>
  <PMID>18519880</PMID>
  </CommentsCorrections>

I can easily turn this into,

  pubmed:foo cito:cites pubmed:18519880

so far so good, but already I notice that I have no place to hang the
text of the RefSource.

So take this next one,

  <CommentsCorrections RefType="ErratumIn">
  <RefSource>J Infect Dis 1998 Aug;178(2):601</RefSource>
  <Note>Whitely RJ [corrected to Whitley RJ]</Note>
  </CommentsCorrections>

Here we have a kind of citation, although maybe it is stretching it to
call an erratum a citation, maybe not, but firstly we have no
predicate in cito to express this, secondly there is no obvious place
to hang the text of the source (could and probably would use a blank
node) but most importantly, the Note, which can appear in any
citation/comment, also we have no place to stuff that.

I could give more examples, but what I'm getting at is that modelling
citations as predicates is problematic because we actually have an
infinite variety of citations with different shades of meaning, and we
don't want that to mean an infinite variety of subtly different 
predicates (theoretically it is possible and coherent to do this but
practically it is not). When this happens it generally means that one
wants to move the modelling to classes.

So I might write something like,

  [
    a cito:Citation;
    cito:citedBy pubmed:foo;
    cito:cites pubmed:18519880;
    dc:bibliographicCitation "Psychosom Med. 2008 Jun;70(5):539-45"
    dc:description "some notes about the citation"
  ].

Doing it this way means that you can refine the citation in a way that
has formal semantics by refining the rdf type, and you can refine it
informally by adding other descriptive statements to the citation
instance.

One problem is that there is an equivocation on what a citation is.
SWAN, which CITO is based upon, thinks that a citation is a uniquely
identifiable reference to a work/book/whatever, it actually says,

  Information which fully identifies a publication. A complete
  citation usually includes author, titl e, name of journal (if the
  citation is to an article) or publisher (if to a book), and
  date. Often pages, volumes and other information will be included in a
  citation."

Well no, that's not really right. A citation is a *reference* to a
publication, which obviously is best done with some kind of
information to identify it, maybe even a URI but it is not that
information. The nature of the reference is part of the citation but
has nothing to do with the identifying information. The identifying
information is a URI or description and we already have that. The
citation is about the relationship between the things
described. (Sorry for being repetitive here, difficult to express
clearly).

So that's a modelling problem in SWAN, I think. CITO inherits it
because it derives terms from it. Now SWAN is not a lightweight
vocabulary. It's a heavyweight OWL beast. Which means that I'm not
even sure if I can mix CITO and BIBO without entailing
contradictions...

What to do? In the near term we can release this large dataset sans
citations but that would actually be quite a shame... Also have to
check this with the copyright people - is the fact of citing something
that pubmed would claim to own? But that one's a distraction for most
of the people on this list...

Cheers,
-w
-- 
William Waites                <mailto:ww@styx.org>
http://river.styx.org/ww/        <sip:ww@styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45
Received on Sunday, 8 May 2011 10:33:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 8 May 2011 10:33:35 GMT