W3C home > Mailing lists > Public > public-xg-lld@w3.org > August 2010

Fwd: Cataloguing Bibliographic Data with Natural Language and RDF

From: William Waites <william.waites@okfn.org>
Date: Mon, 09 Aug 2010 13:24:41 +0100
Message-ID: <4C5FF389.2090809@okfn.org>
To: public-xg-lld <public-xg-lld@w3.org>
A little toy I've been playing with. I can have it join the #lld irc
channel if it would be appropriate...

Cheers,
-w

-------- Original Message --------
Subject: 	Cataloguing Bibliographic Data with Natural Language and RDF
Date: 	Mon, 09 Aug 2010 13:20:23 +0100


In the grand tradition of W3C IRC bots, I've started some speculative
work on a robot that tries to understand natural language descriptions
of works and their authors and generates RDF. It is written in Python
and uses ORDF <http://ordf.org/>, the NLTK <http://www.nltk.org/> and
FuXi <http://code.google.com/p/fuxi>.

Before going into implementation details, here's an example of a session:

12:41 < ww> biblio forget
12:41 < biblio> ww: ok
12:41 < ww> Solzhenitsyn's name is "Aleksander Isayevitch Solzhenitsyn"
12:42 < ww> He was born on December 11th 1918
12:42 < ww> He died on August 3rd 2008
12:42 < ww> He wrote TFC in 1968
12:42 < ww> TFC's title is "The First Circle"
12:42 < ww> "YMCA"'s name is "YMCA Press"
12:42 < ww> They published TFC in 1978
12:42 < ww> biblio think
12:42 < biblio> ww: I learned 25 things in 0:00:00.218296
12:42 < ww> biblio paste
12:42 < biblio> ww: http://pastebin.ca/1913826

The natural language parsing is somewhat simplistic, the kinds of
grammatical constructions it can understand are limited (but growing),
the resolution of pronouns (e.g. he, they) only looks at the previous
named subject and it will get confused if there is more than one pronoun
referring to a different thing in the same sentence but all of these
things can be improved.

Broadly, the process follows the following steps:

    * (NLTK) Tokenise the sentence and classify for parts of speech
    * Create references for named entities (capitalised words, URIs and
      phrases enclosed in double quotes)
    * (NLTK) Create a lexicon, the part of a grammar that grounds it to
      individual words and append it to the canned grammar that
      describes the structure of sentences. This is a feature grammar
      not a context-free grammar
    * (NLTK) Parse the input sentences creating a syntax tree with the
      root at the main verb in the sentence
    * The syntax tree is annotated with the logical structure of the
      sentence (see Analysing the meaning of sentences
      <http://nltk.googlecode.com/svn/trunk/doc/book/ch10.html>). This
      logical representation is cunningly constructed so as to also be
      runnable Python code (with eval
      <http://docs.python.org/library/functions.html#eval>). Running it
      transforms the syntax tree into an RDF representation.
    * (FuXi) the "biblio think" command causes the RDF of the current
      session to be run through a number of inference rules that encode
      higher level meaning. That if "X wrote Y" then X must be a person,
      Y must be a work and X must have contributed to Y.

The neat bit is really the way it generates RDF, translating a logical
structure that looks like,

  statement(
    predicate(
      bnode(
        rdf_type(umbel("Verb")),
        label("is"),
        racine("be"),
        tense(nlp("Present"))
      ),
      named("aHLIkuXm14335") # "The First Circle"
    ),
    posessive(
      bnode(
        rdf_type(umbel("Noun")),
        label("title"),
        racine("title")
      ),
      named("aHLIkuXm14333") # "TFC"
    )
  )

and the constituent parts bubble up and return an RDF Graph that looks
like this:

 entity:aHLIkuXm14333 a nlp:NamedEntity;
     rdfs:label "TFC". 

 entity:aHLIkuXm14335 a nlp:NamedEntity;
     rdfs:label "The First Circle". 

 [ a umbel:Verb;
     rdfs:label "is";
     lvo:nearlySameAs lve:be;
     nlp:directObject entity:aHLIkuXm14335;
     nlp:subject [ a umbel:Noun;
                   rdfs:label "title";
                   lvo:nearlySameAs lve:title;
                   nlp:owner entity:aHLIkuXm14333];
     nlp:tense nlp:Present].


And this sort of structure is the basis for the reasoning step.
Provenance information, using OPMV
<http://open-biomed.sourceforge.net/opmv/ns.html> is also kept, pointing
back to the original IRC message that was parsed so the entire process
should be repeatable.

I suppose since IRC is not necessarily the most accessible of media --
though I can't really see why -- the same engine could be easily glued
to a web server with a simple chat-like interface. Perhaps this is
easier or more natural than web forms. Perhaps not. More research is needed.

In any case, I'm working on improving the natural language parsing and
the inference rules as time permits so hopefully the robot will become
more and more clever.

Source code for the IRC bot is available at: http://bitbucket.org/ww/sembot

You can play with a live version of the bot by joining
irc://irc.oftc.net/ and joining #okfn or engaging in a private chat with
/biblio/. It understands the command "sembot help" and I'll try not to
break it too badly while anyone's playing with it.
Received on Monday, 9 August 2010 12:26:12 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 9 August 2010 12:26:13 GMT