MINDSWAP trip report

Trip Report by Dave Beckett on visit to MINDSWAP group,
   University of Maryland, College Park, MD, USA
   2002-09-23 to 2002-09-27

[ This trip was partially funded by SWAD Europe ]

Monday 2002-09-23

  The first working day, I had mostly with Bijan Parsia (hosting me)
  going over the background and the mindswap groups' projects.  The
  group:
    http://www.mindswap.org/
  is based at the MINDlab at the University of Maryland (UMD),
  College Park which is just North of Washington DC.  The group is
  relatively new and last year Professor Jim Hendler started teaching
  the first semantic web classes to the UMD students, several of who
  now work for the group.

  Jim is best known for his work on SHOE (Simple HTML Ontology
  Extensions) - http://www.cs.umd.edu/projects/plus/SHOE/ - but has a
  background in agents, robots and knowledge representation.  He was
  seconded to DARPA for several years to run the Darpa Agent Markup
  Language (DAML) project - http://www.daml.org/ - which later fed
  into the DAML+OIL ontology language, that became the basis of the
  W3C's work.  The latter is being developed by the Web Ontology
  Working Group (WOWG) which Jim co-chairs:
    http://www.w3.org/2001/sw/WebOnt/
  and the language is now called OWL (Web Ontology Language).

  The group now works on several projects related to DAML+OIL/OWL,
  RDF and Parka, more of which later.  They have created several RDF
  and Ontology markup tools for desktop use that allow attaching of
  ontological information (properties, classes) to descriptions,
  creating instance data.  The two main ones are:
    SMORE - Semantic Markup, Ontology and RDF Editor
      http://www.mindswap.org/~aditkal/editor.shtml
    RIC - RDF Instance Creator
      http://www.mindswap.org/~mhgrove/RIC/RIC.shtml
  which are both Java+Swing applications.

  The Parka system - http://www.mindswap.org/2002/parka/ -
  is a knowledge base (in the AI sense) and has been deployed and
  used commercially for several years; but can be considered a large
  triple store, thus suitable for RDF/OWL storage.  Parka has
  recently been released as open source software by Mindswap -
  http://www.mindswap.org/2002/parka/ - but is a little raw at
  present.  They have been talking to me about this while I was
  working on the SWAD Europe storage report:
    http://www.w3.org/2001/sw/Europe/reports/rdf_scalable_storage_report/
  and it looked interesting.  Bijan arranged for one of the original technical
  developers on the project to visit on Wednesday for a runthrough of
  the system.

  The group is also working on DAML-S - http://www.daml.org/services/
  - a DAML-based Web Service Ontology.  This allows description of
  web services that can be tied to WSDL and used with SOAP and so on
  to create web services in a semantic web style.  There is a new
  version 0.7 of the DAML-S ontology due out soon.


Tuesday 2002-09-24

  I was pleased to find out that Jim Hendler had made himself
  available to me all day; which was an unexpected delight.  I gave
  him an outline of SWAD Europe work and the possible areas for
  collaboration we might have.  It had already been noted that their
  work on the storage systems and mine on the report above and with
  my Redland system - http://www.redland.opensource.ac.uk/ - would be
  useful to do collaborate on. Benchmarks and standard datasets for
  testing stores are further possibilities.

  Mindswap also plan to use Redland software as an interface layer
  above Parka for using and abstracting from it, so that they can
  potentially swap out the backend to another datastore for testing,
  benchmarking; rather than build applications with a parka-only
  binding.

  The other main relationship with SWAD Europe work was in querying;
  since Jim et al had worked on DAML query languages and this effort
  was continuing.  The SWAD Europe work on semweb QLs is extensive
  and they are keen for collaboration in developing use cases,
  documenting these things and moving them towards standardising them
  in due course.

  Jim, Bijan and I also discussed the W3C semantic web standards
  efforts and sketched out some solutions to tricky syntax problems
  that were being raised by WebOnt.  It hopefully quickened the
  resolution since I could discuss pro-s and con-s of different
  solutions directly.  The particular topic here was OWL's proposed
  use of elements outside rdf:RDF for doing ontology description.

  After seeing some of the desktop tools, I outlined the work I did
  on the MEG Registry project - http://137.222.34.57:6543/ - and the
  desktop schema creation tool created by Damian Steer that talks to
  it - http://www.ukoln.ac.uk/metadata/education/regproj/
  This is something that I plan to show at the SWAD Europe workshop
  at the Dublin Core conference in Florence in mid-October.

  18:00 CMSC 828y class - AI on the Web
    http://www.cs.umd.edu/users/hendler/CMSC828y/

  Sat in on their class where Jim was going over the OWL guide
  document to introduce them to the ontology language.  He found a
  load of errors, but it was a brand new draft document.
  See also http://www.w3.org/TR/2002/WD-owl-features-20020729/

  The class wiki: http://www.mindswap.org/cgi-bin/webai/moin.cgi


Wednesday 2002-09-25

  09:00 Meeting about Parka with Merwyn Taylor
    http://www.cs.umd.edu/users/mtaylor/
  who worked on Parka but now works at Johns Hopksins University.
  Also present were Bijan Parisa, Ron Alford and Ronald Reck.

  We went through a presentation (PPT slides) on how the system
  worked and discussed the built-in limitations, constraints and
  differences against the RDF triples model.  Parka has no specific
  literal indexing in it; it hashes the entire string content.  It
  seems actually that there is no distringuishing strings from URIs,
  that has to be imposed as a practice on top of the basic core.

  The knowledge base has special in-memory indexing of the subclass,
  subproperty, instancing relations providing quick indexing.  This
  is done via static sized arrays which work fine for typical data
  but would need a recompile to extend them.  Parka was optimised for
  quick response time - hence the fixed arrays and consideration of
  fast access to the disk, done by always operating in fixed sector
  sizes (4k).

  The KB is built on top of an internal relational store which
  provides simple and quick access.  This was developed by students
  (including Merywn) as part of a database class and has been stable
  but there may be a new version if the class took it further.
  Merwyn said that they had tested Parka on Oracle but the overhead of
  using it via a query language (SQL) proved too much.

  We had a discussion of the limits in the parka database code, which
  are somewhat tied to it's frame-based approach on the data model.
  There is a 2.4M limit in the code on the number of distinct frames
  (subject URIs in RDF-speak); note frames are not assertions.  This
  is mostly caused by using an integer for indexing - the 32 bits of
  the ingeger are split up, limiting the range substantially.  The
  code does support "namespaces", although I'm not sure what that
  means or how it helps in this regard.  The tables could also be
  "chunked", pointing to next table at the bottom of full ones.

  Other restrictions include how the library is using 4K pages,
  marked in a fixed array, tracking usage.  This is stored in a 4K
  header page meaning a limit of (under) 32K pages.  The page size
  could be increased, although performance is best when it matches
  the disk page size.  There are also othe rmemory usage per frame to
  consider, such as the structural assertions (isa, instanceof, ..>)
  which could be rather a bloat if the data itself was mostly
  ontological or schema information.  Some of the rdf/s data seen has
  this characteristic, but most is vastly more data than class and
  property relationships.

  Throughout the above discussion, I related the various aspects of
  it to what I had recently read on how TAP -
  http://tap.stanford.edu/ - solves them which addresses a similar
  problem area, but for a specific fixed schema.  It uses MySql below
  as one of the storage mechanisms but also provides several
  in-memory and partially in-memory (mmap-ed) stores.  TAP, like
  parka, provides optimised support for the structural assertions, at
  the RDF-schema level.

  After that discussion, this led onto how a better indexing library
  could solve some of these problems and reduce the complexity of
  parka.  I suggested the backends could be MySQL, directly via it's
  C interface, not via SQL or possibly going direct to
  Berkeley/Sleepycat DB which is now one of the table tables that BDB
  allows.

  Merwyn then finished with an explanation of the parka query
  planning and evaluation.

  I outlined how Redland somewhat overlapped in some of Parka's
  activity but from an RDF point of view.


  12:30 MINDSWAP weekly meeting
  Aditya Kalyanpur, Ron Alford, Amy Loomis, Ronald Reck, Matt Westhoff,
  H. Ross Baker, Mike Grove, Jennifer Golbeck, Bijan Parsia and me

  They all discussed what they'd been doing for the last week and
  then we had a demo of SMORE.  I went over SWAD Europe in outline,
  the survey stuff I'd done/was doing and Redland.


  14:00 Toru Ishida - Social Agents in Digital Cities
  --  http://www.lab7.kuis.kyoto-u.ac.jp/

  Talk on the use of social agents that understand the conventions of
  social systems to foster human/human interaction.  Uses interaction
  scenarios that are described in a language.  Toru was just
  finishing a visit to the mindlab and had been sharing the office
  with Bijan.


Thursday 2002-09-26

  Kendall Clarke arrived, visiting Bijan - he writes for
  XML.com (as does Bijan), O'Reilly and does technical and other
  writing and editing -- http://clark.dallas.tx.us/kendall/

  Aditya gave a demonstration of SMORE to Kendall and I, we commented
  on the user interface since there were several things that seemed
  unclear to us as we watched him.   It allows you to markup some
  text (HTML) written as sentences into triple form, then browse and
  search existing ontologies to find appropriate terms for the
  relationships, classes.  They can be dragged and dropped to form a
  description of the content in RDF/DAML+OIL (soon OWL) form.  The
  resulting description can then be saved.

  Met with Ron Alford and discussed the details of implementing a
  storage backend to Redland, such as they proposed to do with
  Parka.  Lots of looking at source code and explaining stuff that
  really should be in documentation somewhere ;)


  12:00  Freedman's Bureau Project
  -- http://freedmensbureau.com/

  Meeting with Harry Keeling and Jerry ? (Howard University, DC)
  -- http://www.founders.howard.edu/CEACS/Departments/CompSci/
  along with Jen, Jim, Bijan, Toru and Kendall.

  This is a (proposed) digital library project based on taking the
  preserved paper records from the 1865- records made by the
  "The Bureau of Refugees, Freedmen and Abandoned Lands,"
  which dealt with providing support for the newly freed people.  The
  result is a large amount of microfiche, which can then give lots of
  images.  The idea is to set up a system that allows multiple people
  to annotate and comment on the documents in the images; since there
  are lots of views or ways of approaching their description.  Some
  basic ontologies would start it off, but the expectation would be
  to provide ways to have all interpretations representable.

  I was sure I'd heard of a similar project (scanning, images,
  annotations, Dublin Core) but couldn't remember or google for the
  references.


  18:00 CMSC 828y class - AI on the Web
    http://www.cs.umd.edu/users/hendler/CMSC828y/

  Next class and during it I gave an overview of RDF tools and
  resources on the web that they could use.  The programming
  experience here seemed to be majority Java, some Python, 1 perl out
  of the approx 25 students.  I also went through the details of some
  of the RDF bots work - logger and especially foafbot, which uses
  redland underneath, but in the detail of how it manages trust
  relationships.   Also encouraged them to hang out on the #rdfig
  channel.

  Met for the first time, Jordan Katz, a high school student
  attending this graduate class, who has been working on displaying
  RDF via XSLT sheets to give multiple views of the document
  depending on the audience.


Friday 2002-09-27

  10:00 RDF Core Telecon

  Agenda:
  http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Sep/0329.html


  11:20 Meeting with Ronald Reck

  Suggested various surveying things he could do on the datasets he
  had that would be useful to find out.  Some of the parka
  restrictions might be worth expanding if common datasets had
  problems with them (large literals say).


  14:00 Redland overview

  I gave a whiteboard outline of Redland and the state of the changes
  I was making to Ron Alford, Bijan, Kendall and Jim (partially).
  Some of this is partially done (iterators change) and some is
  planned (web, query, iostreams) and required to support other
  features.  I tried to show how Parka, querying and relational
  backends would fit into the picture which Redland currently has
  some skeleton support for, but not complete.


Saturday 2002-09-28

  Travelling back via Dulles Airport - got a lot of hacking done on
  Raptor which seems to be reducing in bug count:
  http://www.redland.opensource.ac.uk/raptor/


Plus throughout the week lots of debugging various bits of code and
helping them with python and php access to Redland as well as doing a
little report writing (Monday only).

Received on Thursday, 3 October 2002 05:46:58 UTC