What sub-problems are we
addressing?
Why DSpace should embrace RDF.
How, specifically, should DSpace
use RDF, both initially and longer-term?
How does this compare with a
closed-world, relational or object-relational database approach?
Can’t we just support RDF as an
export format for our chosen DB?
What future research opportunities
are enabled if this path is chosen?
What specific tools and components
would this require?
Isn't this risky? How will we deal with those risks?
This note proposes the use of a graph-oriented schema-description mechanism and the use of a persistent triple-store for such schema descriptions and selected metadata as a key storage mechanism within DSpace, along with a PostgresSQL O/R store.
Specifically, the proposal suggests that DSpace:
· Rely on established (PostgresSQL) O/R database technology for scalability, transactions, and efficient query
· Use RDF-Schema + DAML/OIL to explicitly capture the DSpace ontology
· Store the resulting ontology-defining RDF in a persistent PostgresSQL-backed triple-store
· Use this ontology to drive the creation of closed-world logical (XML-Schema) and physical (PostgresSQL tables) schemas
· Store ad-hoc assertions in the triple store
· Store instance metadata initially only in PostgresSQL, but eventually in the triple store as well. Export instance metadata using standard protocols (e.g. OAI).
· Hold as a goal the use of views generated from the triple store to create re-factored PostgresSQL tables and load them with appropriate instance metadata.
The following needs within DSpace define the scope of discussion within this note.
1. Need
for information transparency and longevity.
In the long run, DSpace will be defined more by its corpus of information than by the specific software system that stores, provides access to, and distributes that information. The DSpace corpus will outlast the DSpace system. External agents need transparent mechanisms to access all of the corpus, to understand its semantics, and (eventually) to migrate it to subsequent system(s).
The DSpace corpus is defined by more than just the information that was originally submitted. It also includes metadata that is generated over time to aid in storing, administering, accessing, and distributing the content. Examples of such metadata include information about how the info is currently structured, how it has been transformed and what person or agent transformed it, information about how it has been used, how it has been described over time, and what other information is related to it.
2. Need
for active management of many schemas, over a period of time.
The DSpace corpus will be characterized by many schemas, most of which will be simple. It will include many different types of holdings, types of digital media, and content formats. The system will need to generate, understand, and use metadata to manage structural, administrative, contextual, and descriptive information about the currently available types of holdings, types of digital media, and content formats, as well as about the holdings themselves.
Librarians and DSpace administrators will need to maintain a set of logical schemas that describe the metadata used within the system. These schemas may have ontological relationships to one another, which they will also need to maintain.
The particular schemas needed to manage the DSpace corpus will change over time. That is, as new types of content arrive and the needs of the designated community shift over time, new schemas will come to be required, and other schemas will cease to be important for the purpose that they were originally intended.
System software is likely to at least some degree become dependent upon the current set of schemas, which strengthens the need for the schemas to be actively managed. The relationships between the schemas and system software need to be well understood, as does the impact upon the software of changes to the schemas.
3. Need
to accept information and begin creating instance metadata before schemas are
fully understood.
That is, need to enable and be resilient to evolving schemas.
It is not possible to fully anticipate the metadata needs associated with new types of holdings, new ways to use holdings, or new functionality within the archive.
While implementers and the designated community are exploring these needs, the archive needs to be able to begin to accept initial content.
In practice, schemas are often emergent. An initial attempt is made to define a schema, but the real needs are discovered through capture and attempted use of a corpus of content.
4. Need for external and ad-hoc assertion capabilities.
Because of its institutional affiliation, the DSpace system will be the system-of-record for maintaining the DSpace information corpus. But other humans and systems will want to refer to the corpus and make statements (assertions) about it, including references to its schemas and instance metadata. This is problematic if the schemas are changing with time.
The set of schemas maintained by DSpace administrators are market-driven. That is, they reflect the information needs that librarians have discovered meet the requirements of most of the designated community of consumers, as well as the information needs of the system itself, given the features that the market demands that it contain. But corner-cases abound. The schemas within the system will not cover the information needs of all of the market.
Since schemas are often emergent, it is important that information that doesn’t fit into current schemas not be lost to the system (and its administrators). Individuals contributing information that does not fit nicely into current schemas need a way to enter it in an ad-hoc manner.
5. Need for reasonable cost way to periodically re-factor schemas and/or dependent software.
Eventually market demands may lead schemas shift in such substantial ways that they must be radically reorganized into new logical groupings. When this happens, underlying physical implementations will likely need to be revamped as well. This is undeniably a long-term cost that DSpace host institutions must anticipate. From a system design perspective, strategies to minimize these long-term costs are important.
·
RDF-Schema + DAML/OIL
allow schemas and ontologies to be specified and processed in a way that is
transparent, exportable, machine processable, and (asserted to be) standard.
·
A DSpace Ontology
captured in RDF-Schema + DAML/OIL would provide an explicit definition of what
information is used within the system, and how the information is related and
organized. As such, this ontology could
be actively managed.
·
RDF is particularly
tolerant to “mostly structured” data, and data whose structure is emergent.
·
Usage of RDF-Schema +
DAML/OIL would enable both external and ad-hoc assertions about DSpace schemas
and their instances. These could be
enabled without necessarily relying on a triple store for the bulk of DSpace
instance metadata.
·
Structured-only
approaches on RDBMS or ORDBMS substrates carry significant long-term risk
induced by schema-rigidity.
·
RDF enables future work
in ways to automatically produce re-factored physical views of data whose
structure have evolved over time. This
is an important problem to solve for the future of information and services on
the web.
·
Turning on an
RDF-oriented technology stack within DSpace enables future research on the
DSpace platform in areas such as provenance, inferencing for service discovery,
automatic composition of transformations, graph-oriented query mechanisms,
scalability of triple-oriented data stores, etc.
· Rely on existing (PostgresSQL) O/R database functionality for scalability, transactions, and efficient query.
For the foreseeable future, de-risk by not relying on a triple store for primary storage. There are a series of steps that the project could take, as a triple-store implementation is established and we use it to build sufficient confidence to enable the next step. First, the triple store would store only information about the DSpace ontology and schema. Next, ad-hoc assertions could be added. Then, the triple store could mirror the primary store. Finally, a triple-store-only implementation might be possible.
· Use RDF-Schema + DAML/OIL to explicitly capture the DSpace ontology.
Publish and version the DSpace schema. This allows a canonical representation of the DSpace schema and instance metadata outside the system, which further enables external parties to make whatever statements they desire about any information resource in the DSpace corpus.
· Leverage existing RDF-schema definitions for standard descriptive-metadata formats (e.g. Dublin Core, MARC).
These exist, so using them seems to make sense.
· Store the resulting ontology-defining RDF in a persistent PostgresSQL-backed triple-store.
Establish a triple-store as another storage mechanism (besides relational tables) within DSpace. For manageability, a PostgresSQL-backed implementation would seem to simplify things a bit, since we are already using PostgresSQL to store instance metadata in PostgresSQL tables.
· Use this ontology to drive the creation of closed-world logical (XML-Schema) and physical (PostgresSQL tables) schemas.
That is, create tools that can create, from the canonical RDF + DAML/OIL schema definitions, both an XML-Schema definition (.xsd) file that defines valid logical instances of the schema in XML, create table definitions in PostgresSQL, and perhaps generate code to generate java objects that correspond to the logical schema, display a UI to capture an instance, etc.
This is similar to Peter Breton’s existing proposal for a simple O/R mapping layer within DSpace, except instead of the logical definition being an XML-Schema (with perhaps only a subset of XML-Schema features supported), this proposal considers the source to RDF Schema. Implicit in the success of this strategy is keeping the schemas quite simple. In fact, from a simple DSpace-proprietary schema config we might be able to generate all three destinations (RDF-Schema, XML-Schema, and relational tables).
Succeeding with this approach requires resolving the low-level data typing issues that were consciously omitted from the initial RDF-Schema spec, and expected to be drawn from work on XML-Schema. Resolving this issue is a deliverable of the current semantic web working group.
· Store ad-hoc assertions in the triple store.
Since it would exist, as would a definition of the types of information objects that DSpace deals with, individuals could contribute assertions about the corpus as they saw fit, and we could store them little extra effort.
· Store instance metadata initially only in PostgresSQL, but eventually in the triple store as well. Export instance metadata using standard protocols (e.g. OAI).
Relational tables would remain the primary metadata store for the foreseeable future. As the tool path was demonstrated and we built confidence, we could add an extra spigot to incoming metadata, and store it in the triple store as well as in relational tables.
Instance metadata would be exported via standard protocols (e.g. OAI protocol for metadata harvesting), with schema definitions for export deriving from the RDF-schema definitions.
· Hold as a goal the use of views generated from the triple store to create re-factored PostgresSQL tables and load them with appropriate instance metadata
A long-term goal is a solution to the “schema rigidity” problem. With instance metadata in a triple store, and schema definitions in RDF-Schema, an appropriate toolset ought to be able to evolve the schema and create a new set of views from the triple store which could be used to load a new set of relational tables corresponding to the evolved schema. Then the old tables could be decommissioned. This might not perform particularly well, but would be required rather infrequently. The benefit sought is a lower cost to re-factor and get the new set of tables up and running.
There is much in common. In fact DSpace currently runs on PostgresSQL relational tables, with some support for initial setup and no explicit support for adding new schemas while the system is operating. The proposal augments the primary relational store for instance metadata with a rich definition of the schemas themselves, so that the schema definition can be managed independently from the instance metadata. Hopefully this will enable some automated support for updating relational tables and instance metadata as the structured schemas evolve in a controlled manner.
The proposal does not initially (and may never) rely on the triple store for transactional capability, or for query. It does not initially rely on the triple-store to store massive amounts of instance metadata. In this respect, many of our eggs are firmly in the relational DB basket.
Further, since the initial focus is not on query performance and large scalability in the triple store, the proposal focuses on triple store implementations that are RDBMS-backed, and in particular that use PostgresSQL to provide persistence.
The proposal does not advocate that DSpace adopt a
semi-structured data model. It proposes
that DSpace retain a closed-world, fully structured data model. It further proposes that DSpace use some
tools from the semi-structured world to define the ontology and schema used be
the system, with an eye towards lowering the cost to DSpace administrators of
maintaining and evolving that ontology and schema in a controlled manner. Finally, it proposes that DSpace implement a
fairly independent facility to capture ad-hoc assertions that do not fit into
the current, closed-world, fully-structured DSpace schema.
I don’t think it means anything to “export RDF” unless there is an RDF-Schema available that defines what it is that we are exporting. The “concepts” used by DSpace need to be published so that exported instances can be interpreted.
Here is a brainstorm:
· Usage of DSpace as a platform for research in areas such as provenance, inferencing for service discovery, and automatic composition of transformations.
· Research involving relationships among DSpace schemas, query translation, etc.
· Research into triple-oriented storage: query expression, query efficiency, transactions, persistence mechanisms, scalability.
This is an exercise that we need to complete in the process of evaluating this proposal, or in the first steps as it moves forward.
There are definitely risks that need to be managed. Here are some that I have thought of. I welcome you thoughts about preventive measures and/or contingencies to help us manage these risks.
·
We discover large
missing pieces in the tool chain.
·
The tool chain is
complete but is found to be immature and unstable.
·
We find that we could
establish a complete and stable tool chain, but that the resources required to
do so exceed those available within the Cambridge DSpace team.
·
RDF, RDF-Schema, and
the rest of the W3C stack does not become standard. The industry moves in a different direction.
·
The goal of supporting
“emergent schemas” and schema evolution turns out from an organizational,
usage, and adoption perspective to be misguided.