Bio Oriel XML : standards for the representation and exchange of biological information

Alain Viari, Antoine Brun, Anne Morgat
INRIA Rhône Alpes - 655 Av. de l'Europe - Montbonnot 38334 Saint Ismier Cedex - France

Context

The BOX (Bio Oriel Xml-Schema) project is currently being developed as a part of the ORIEL (Online Research Information Environment for the Life Sciences) project (http://www.oriel.org), funded by the European Commission (IST-2001-32688) and started, for three years, on January 2002.

One of the project's objectives is the definition of strict standards of representation and exchange of the biological factual information for use by other ORIEL's partners. In this context the BOX project adresses the following points :

The question of the representation and the exchange of factual data cannot be separated. A document (exchange media) containing biological data is always built upon an implicit or explicit model (representation media) of these data. Usually the exchange model is often explicit (e.g. EMBL or GenBank flat file format) but the representation model is not. This often give rise to ambiguities or difficulties when retrieving the information. In BOX we want to make the representation model as explicit as possible and its link to the exchange model as unambiguous as possible.
Using well accepted standards is an obvious requirement (unless these standards do not fulfill the objective). In this context we make use of UML for the representation models and of W3C XML-Schema for exchange models.
The BOX specifications (both for representation and for exchange) should be strict. This means that a document not conforming to any one of the models should be rejected by a simple generic parser. The validity of a document is not only a question of syntax (e.g. typing "geene" instead of "gene" but also a question of model's semantic (e.g. asserting that one gene belongs to more than one organism). In the BOX exchange model we set up explicit constraints (e.g. identity and cardinality constraint) for a valid document
The BOX specifications (both for representation and for exchange) should be pragmatic. This means that already existing data should easily be represented using BOX. Defining models so sophisticated that no existing data can fit into them will be of poor practical interest.

The primary objective of BOX is therefore to provide an open core of well-defined UML and XML specifications for the dissemination of biological data together with data manipulation tools. The current release of this core library deals with metabolic data and genome annotation data and is composed of :

General design and guidelines documents that explicits the BOX model and the conventions used for the XML-Schema implementation.
Biological data model specifications together with their XML-Schema implementation and documentation.

Architecture

The BOX package is composed of three distinct components : BOXml, BOXtk and BOXweb.

BOXml

BOXml is primarily a core library of XML-Schema specifications. Its first goal is therefore to provide XML-Schema components (rather than definitive format specifications) to XML designers. To this purpose BOXml is organised as modules (15 modules in BOX version 1.3) that may be reused either in BOX or in other XML oriented projects.

module	description
BoxCore	BOX core types and elements
BoxChemical	types and elements representing chemical compounds and reactions
BoxChemicalKb	root element representing a KB of chemical compounds and reactions
BoxExtChemicalKb	extension of BoxChemicalKb with additional external chemical references
BoxKeggKb	extension of BoxExtChemicalKb for representing KEGG entries (sample)
BoxOrganism	types and elements for representing organisms and taxonomies
BoxOrganismKb	root element for representing a KB of organisms
BoxPathway	types and elements for representing biochemical pathway
BoxPathwayKb	root element representing a KB of biochemical pathways
BoxReplicon	types and elements representing replicons (e.g. chromosomes)
BoxLocation	types and elements representing sequence feature locations
BoxFeature	types and elements representing sequence features
BoxSequence	types and elements representing sequence data
BoxBioSequence	types and elements representing biosequences (i.e. sequence data + related biological information)
BoxBioSequenceKb	types and elements representing a KB of biosequences (i.e. sequence data + related biological information + features)

BOXtk

The second step of the project was to provide a Java programming toolkit to simplify the manipulation and display of BOXml data. The BOXtk toolkit provides the following functionnalities :

Transformations tools. BOXtk provides several tools to perform transformation from already existing formats, either in XML (NCBIml (http://www.ncbi.nih.gov/IEB/ToolBox/XML) or BSML (http://bsml.org)) or flat files (EMBL (http://www.ebi.ac.uk/embl/Documentation)). Transformation from XML are perform by using XSLT stylesheets and non XML formats are handled by ad-hoc JavaCC (https://javacc.dev.java.net) parsers.
Querying tools. The BoxTk API provides simple way to querying BOXml documents thru XQuery.
Java API to BOXml entities. These classes are generated by JAXB (http://java.sun.com/xml/jaxb) from the schemas and provide the entry point for further manipulation of BOXml data within Java programs.

BOXweb

Finally, the BOXweb component was designed in an effort to provide a simple web access to BOX resources to the ORIEL partners and to the community. To this purpose BOXweb provides access to BOXtk services thru SOAP/WSDL. It also serves as an introductory guide for developpers who want to embed BOXtk in their own webservices.

Availability

All documentation and packages are available for download at : http://oriel.inrialpes.fr:8080/box
This website also provides on line demonstrations of BOX components.

Contact: Alain.Viari@inria.fr