Andy's comments thus.

These comments are part of the process of preparing the "relevant technologies" document for the technical plenary 23-24 July 2003.

Some structure might help the plenary discussions:

There is a Semantic Web platform layer which covers:

ontology (vocabulary, schema etc etc)
- registry
- ontology versioning and metadata upgrade (? or is this in the "content" layer?)
- community creation and modification of registered ontology
storage and metadata
- One or more units
- History, metadata, content
- Needs to be able to support "digital objects" if the upper layers use such
distribution
- SIMILE is a federation of content/metadata servers
- Unclear there is any requirement to do more than that!
Provenance
- It does not appear much in the text but we may need to track the origins of content and metadata in SIMILE at some level to be decided.

There is a Content Layer which covers:

Data models specific to content in SIMILE or the library domain
validation of incoming metadata WRT an ontology
Note metadata can be tightly tied to content
Content storage
?? Object editting and manipulation
Mutliple presentations

There is a Domain Layer (Application Layer?) which covers:

Processes specific to the library domain e.g. additional verification; quality control

There is a Client Layer which covers what SIMILE clients do. The server is presumably rather neutral to the nature of the clients.

In looking through the document, I can identify some significant areas for prototyping and investigation:

Ontology registry and lifecycle, including how vocabularies are associated with metadata

Platform level: start small and grow. Continuous prototyping.

Content standards & metadata standards (inc compound objects?) and systems. FEDORA, SCAM, etc etc

As to (2), one approach is to use a continuous prototyping development style, whereby each cycle of the platform development implements just enough to get some usage working. This is a style from "Extreme Programming". This avoids the "mega-design" effect whereby the platform is expected to do everything. Instead, the objective is to get something, how ever small, working so feedback from the components on top of the platform are based on a concrete system, not a paper design. Each cycle should be short.

SIMILE Relevant Technologies

Mick Bass mailto:mick.bass@hp.com

Mark H. Butler mailto:mark-h.butler@hp.com

June 16, 2003

Printable PDF Version: http://web.mit.edu/simile/www/documents/relevantTechnologies/technologies.pdf

Issue List: http://web.mit.edu/simile/www/documents/researchDrivers/rd_issues.html

Public, Archived, Feedback To: mailto:www-rdf-dspace@w3.org

RDF Issue List: http://web.mit.edu/simile/www/documents/researchDrivers/rd_issues.rdf

RDF Bibliography: http://web.mit.edu/simile/www/documents/researchDrivers/simileBibliography.rdf

1 Introduction

This isn't about relevant technologies so much as being about the problem space. For many huge areas, SIMILE's appraoch will be to survey existing techniques.

Missing:

Provenance (of content and of metadata)
Copyright and licensing
Charging (or at least explicitly making out of scope).

The SIMILE team, working with the SIMILE PIs, has identified a number of motivating problems for SIMILE. These motivating problems are summarized in the mindmap shown in Figure 1 and will be detailed in more depth in the subsequent text.

**Figure 1:** Motivating Problems

2 Content Services

Currently one of the roles of DSpace, and hence SIMILE, is to act as a content repository. Key problems here include how to leverage that content better:

Content augmentation: Content augmentation is improving the functional presentation of an existing content asset. For example one important requirement is turning citations embedded in a document into hyperlinks to other internal and external assets.
Content transformation: Content transformation involves transforming an asset from one form into another, for example the conversion of Microsoft Word files to Adobe PDF or TIFF images to JPG.

Content Augmentation : subcase of content transformation

I don't see the difference here between content change and metadata change.

3 Metadata

Metadata contains information about content, either from human input or from extraction or generation processes so key problems around metadata include:

Metadata augmentation: Metadata augmentation refers to human-corrected and machine added elements in an instance of a metadata schema. It is the process of adding to or modifying the metadata of a long-term electronic record without degrading the evidentiary status of the metadata [vic].
Metadata extraction: Metadata extraction refers to the extraction and codification of metadata either from other existing metadata and content. An example here is the extraction of embedded track information from an MP3 file.
Dynamic Metadata: There is a distinction between extracting dynamic metadata, and copying the metadata out of an asset. Dynamic metadata may change over time, and must be verified by crosscheck with the source for that data for example RSS feeds. Note Haystack allows users to do event subscriptions on changes to an underlying RDF statement. If you abstract beyond the statement to where the statement came from, then the statement should be updated any time the data source is updated. Copying out metadata on the other hand is more useful for relatively static assets. Print analogues, such as PDF files, may be stable enough to copy out the metadata only when a new revision of the document is produced.
Data mining: Data mining involves looking for recurrent relationships between records, rather than a given record. It can either occur on the raw content, or after metadata extraction from the content and typically involves machine-learning methods based on clustering or classification.
Out of scope for SIMILE?

Good example of an external (non-core) service that fits into a metadata generation service model whereby plugin services can add/refine metadata from the base inputs. Other example include the augmentation/transformation of metadata and content.
Metadata generation: Metadata generation refers to when metadata is created rather than extracted. This may occur using data mining methods as outlined above, i.e. performing metadata extraction from MEDLINE records and then data mining that information. Alternatively it could be done via an annotation service i.e. live input from humans.
Metadata presentation: Metadata presentation refers to the presentation of metadata to the user in a form that is both meaningful and easy for them to explore.

May need something about provenance of metadata : this would require support from the base platform, or at least an idea that the base platform is sufficient.

3.1 Instance data and schemas

Some reorganistion might help here:

New 3.1 :: Current 3.1
New 3.1.2 :: Content
- Naming
- There isn't much else on this in the document at the moment but trsnformation and delivery feature here.
New 3.1.3 :: Metadata
- Validation, syntactic and semantic (I prefer the phrase "domain-specific validation" to "semantic validation")
New 3.1.4 :: Vocabularies and Ontologies
- diversity (current 3.2)
- vocabulary evolution (lifecycle current 3.3)
- including open/closes discussion
- Ontology transformation

Instance data: Instance data refers to information in our model that describes instances in the external world. In symbolic AI, a distinction is made between assertional and terminological information, i.e. ABox and TBox respectively. Instance data is assertional information.
Schema: The term schema has several different meanings. In general terms, a schema is the set of descriptors we are using in our model to describe the external world. More specifically we may potentially deal with both XML and RDF schema languages. XML schema languages describe XML document structure and support validation. RDF Schema on the other hand describes the structure of a specific RDF vocabulary i.e. declares inheritance relationships for classes and properties, provides mechanisms for labelling and commenting, and provides domain and range constraints that constrain the subject and object of properties.
Controlled Vocabulary: A controlled vocabulary is finite set of descriptors. It may optionally provide a thesaurus of descriptors, synonyms, preferred usage terms, relationships among terms and aids to selecting the best terms [FLS].
Ontology: In general terms, an ontology is an explicit specification of a conceptualization. The term is borrowed from philosophy, where an ontology is a systematic account of existence. When the knowledge of a domain is represented in a declarative formalism, the set of objects that can be represented is called the universe of discourse. This set of objects, and the describable relationships among them, are reflected in the representational vocabulary with which a knowledge-based program represents knowledge. Thus, in the context of AI, we can describe the ontology of a program by defining a set of representational terms. In such an ontology, definitions associate the names of entities in the universe of discourse (e.g., classes, relations, functions, or other objects) with human-readable text describing what the names mean, and formal axioms that constrain the interpretation and well-formed use of these terms. Formally, an ontology is the statement of a logical theory [Gru]. However as we are using ``ontology languages'', ontology may have a very specific meaning that is determined by the particular ontology language in use i.e. an ontology is what the ontology language allows us to describe.

3.2 Diversity Of Vocabularies

One of the drivers for SIMILE is coping with schema diversity, which occurs due to differences in the intended uses of instance data. For example, consider the differences between schemas used by services or agents to describe services that can operate on content or metadata by transforming or enhancing it in some way with preservation schemas that describe how digital resources are preserved i.e. their component parts, their inter-relationship and the changes they have undergone. Other possible applications of schemas include technical schemas, presentation schemas and policy schemas describing the type of operations that can be performed on the content and the metadata.

Another driver of schema diversity is the community-driven forking, i.e. where different schemas have the same basic purposes but have evolved within or serve different communities. For example both BibTex and Dublin Core are applied to a similiar subject area, but they have differences because these standards are aimed at different applications. Other examples here include IMS [ims], MARC [MAR] and MODS [mod].

There was going to be a comparison of various metadata standards - this would be a useful input into scoping SIMILE work in the metadata and vocabulary areas to have a brief comparision here, not just links.

3.3 Information Lifecycle

One key task involved in the creation of metadata is creating schemas and ontologies i.e. formalisms that capture that metadata. There are two axes that define metadata: the schema in use i.e. the metadata structure and the controlled vocabularies in use. Sometimes the schema and the vocabularies are controlled by a single standard but in other instances they evolve independently. Schemas have a specific lifecycle with distinct stages:

Schema Discovery: When a user wants to encode metadata about resources and they have not encoded information about resources of that type before, they have two options: create a new schema or re-use an existing schema. Although users tend towards creating new schemas, users should be encouraged to re-use existing schemas whenever possible. Existing schemas should be extendable so that where necessary the users can extend or enhance them, as this supports both reuse and accuracy of information capture. One key issue in the schema lifecycle is how to guide users toward discovering existing schemas suitable for their needs in an effortless way. Another option is to create new schemas, but then relate those schemas to other schemas at a later time. For more details on this, see Section 3.5.
Schema creation: Even if the users consider schema re-use, there will almost certainly be situations which require either the extension of existing schemas or the creation of new schemas. Creating new schemas is a non-trivial task, so users need to be guided so they use best practices to use when creating schemas [cre], [NM].
Schema evolution: During the lifetime of the schema, it may be necessary to make changes to the schema structure or the vocabularies in use. This may involve subdivision, aggregation, addition or removal of properties or terms.
Schema maintenance: Apart from schema evolution, schemas also change because they contained errors e.g. property names are ambiguous, the restrictions on properties are too lax, there is no controlled vocabulary, there are disagreements about the property or the class hierarchy etc. Hence it may be necessary to introduce changes to the schema to correct these errors.
Schema versioning: As the schema evolves and undergoes maintenance, it may be important to manage these changes so that the schema changes in an orderly, controlled way. This necessitates the creation of different versions of the schema so we know what changes are present in different instances of the schema at different times. When versioning schemas, there are different schema design decisions which can have different impacts on legacy metadata and also on search functionalities, so it may be valuable for the user to have some guidance about how to version schemas. It may also be useful for them to receive some assistance about how to update existing metadata or existing queries that operate on the system.
Versioning instance data: As well as versioning schemas, another problem is versioning instance data and content data. For example with instance data, there may be different versions of the data that correspond to different versions of the schema. Alternatively, there may be different versions of content data, so there may be different versions of instance data as a result. Alternatively the version of schema used in a collection of instance data may vary across the collection. Dealing with multiple versions may have an impact on data consistency and introduce merging issues.

See comment in 3.1

This is the vocabulary lifecycle. The metadata and content lifecycle is covered by the history store; can that be reused here in anyway?

3.4 Open schemas versus closed schemas

The property-centric approach taken by RDF and RDFS opens up the possibility of incremental additions to schemas which do not invalidate current data or schemas, allowing different versions to coexist. This approach is known as ``open schemas'' and is the opposite of ``closed schemas'' which typically adopts a more controlled approach to schema evolution with discrete versioning steps. Therefore one open question is whether the additional benefits of supporting a more continuous evolution via open schemas can overcome the associated additional challenges. There are several aspects of the Semantic Web stack that seem to support such evolvable, open vocabularies.

First, as already noted, Semantic Web languages like RDFS and OWL enable you to describe the classes and properties that make up some vocabulary and the relationship between them, rather than syntactically validate a record. Thus they are not really schema languages but rather vocabulary definition languages. You can do data validation against them but you have flexibility in how do that. A good explanation of this is given in the RDF primer [rdf]. This allows you to describe the features of your vocabulary and yet separate that from the policies of how data is be validated against that vocabulary i.e. whether to allow partial records.

Second, these languages tend to be property-centric. This means you can add a definition of a new property without necessarily invalidating any of your current property or class descriptions. Contrast this to, say, object oriented languages where once you have described the slots on an object you can not add another slot without creating a new sub-class.

Thirdly, types i.e. classes in RDF do not need to be exclusive. That means that a resource can be an instance of several overlapping classes at the same time, in particular it can be both an instance of a 1.0 version of some class and a 1.1 version of the class, at the same time.

The terms "open" schema is confusing:

The second paragragh is correct. In RDFS, the allocation of properties is not "open" as described. "Open" means that a single RDFS document does not necessary completely define the vocabulary; there can be other defintions. But the URI mechanism for properties means that the namespace is controlled by the vocabulary author so spurious properties should not arise in the same namespace without the namespace controllers permission.
I don't see why "property centric" and "incremental additions" are connected.
Using the term "schema" for RDFS is correct but confusing. The working group is downplaying the work "schema" in favour of vocabulary so "vocabulary" is a better term. In SIMILE, we are likely to use RDFS defintions for checking/validation, but that is not the presumed purpose of an RDFS vocabulary.

However not all the issues around open schemas are currently solved. For example extending instance data without providing a concrete schema is often undesirable and as the section on processing models highlights, mechanisms for finding schemas for a given piece of instance data are not yet standardised.

Not sure of the significance of this text at this point.

3.5 Relationships and interoperability

One of the key technologies of the Semantic Web is using either schema languages or ontology languages to perform inferencing i.e. automatically deduce additional assertions from existing assertions. A basic form of inferencing just deals with class and property hierarchies as defined by RDFS i.e. relationships with transitive closure. For example we might define that the class dolphin is a subclass of the class mammal. If the dolphin class also had a property that indicated that dolphins can swim, then we could search for resources on swimming mammals and we could identify dolphin as meeting our criteria. Using inferencing in this way is referred to as inferencing search and it can be used to interoperate between different schema: imagine we have several heterogeneous schemas and a standardized schema. If we can define class and property relationships between the heterogeneous schemas and the standardized schema, then we can use inferencing search to search metadata in the hetereogenous formats using terms from the standardized schema.

Ontology languages like OWL allows a richer set of inferences than RDFS based on cardinality constraints, transitivity relationships, equivalence of classes and individuals, functional relationships, class combinations using unions and intersections and disjoint relationships between classes. Of these relationships, equivalence is particularly important for for interoperability. There are three types of equivalence that are relevant:

Property equivalence: Property equivalence is when a property defined in one schema is equivalent to a property defined in another. This might be used for mapping between two different schemas that use different property names.
Class equivalence: Class equivalence is when two classes defined in different schemas are equivalent.
Value equivalence: Value equivalence is when a property with a specific value in one schema is equivalent to a property with a specific value in another. This might be used for mapping between different controlled vocabularies.
Instance equivalence: Instance equivalence is when two records refer to the same thing. This might be used to merge two versions of the same record.

Equivalence also has transitive closure so can be considered part of basic inferencing search. The other relationships do not necessarily have transitive closure, therefore inferencing using these relationships may be NP-hard or even undecidable in terms of computational complexity. Hence these relationships may not be considered as part of basic inferencing search capability.

There are two types of issues that can occur for both schemas and vocabularies: the first is how to deal with different versions of the schema and vocabulary and the second is how to map between different schemas and vocabularies. Here are some examples:

Mapping between different versions of the same schema: One schema uses a property called authorName whereas another uses two properties called authorFamilyname and authorForename that are both subproperties of authorName i.e. subdivision has occurred. It is easy to map between the divided properties to the undivided properties, but it may or may not be possible to map the opposite way. For example if authorName has been encoded authorForename authorFamilyname then the mapping is trivial, but if it has been encoded inconsistently, it may not be possible. In this case we are faced with having to review all the instance data using this property, or using automated tools to try to identify the inconsistent records in order to simplify the reviewing task.
Mapping between different versions of a vocabulary: Here version of a vocabulary might use two values called East Germany and West Germany whereas a later version of the vocabulary uses Germany i.e. an aggregation has occurred. As in the previous case, it may be able to solve this problem automatically i.e. mapping East Germany to Germany. However, when reversing the mapping it may be necessary to review all instances. Generally reviewing instances due to vocabulary change is less time consuming than reviewing instances due to schema change, as vocabulary changes may only entail a subset of entries whereas schema changes may entail all entries.
Mapping between different schemas: This is more difficult than the schema versioning case since different versions of the same schema are likely to maintain the same ontological commitment but this is not necessarily true for different schemas. For example consider the difference between a schema that describes events as first class and a schema that describes entities as first class. Clearly, just as when mapping between different schema versions, there will be situations when it will be possible to perform automatic mapping and situations when there is insufficient information or the information is sufficiently ambiguous to require human intervention. However, arguably, due to the difference in conceptualizations, mapping between schemas is more likely to require the latter than mapping between different vocabulary versions.
Mapping between different vocabularies: Mapping between different vocabularies is hard, due to the size of some of the vocabularies. It is not uncommon to encounter vocabularies with around over 200,000 terms. In order to map between two vocabularies, it is necessary to create a thesaurus that maps individual vocabulary items in one vocabulary to the other. There are other possible complexities here: for example subdivisions when a term in one vocabulary maps onto multiple terms in the other will require human intervention to disambiguate.

As a Semantic Web platform, SIMILE will use inference. There are significant unknown engineering issues of scaleablility and performance to be dealt with.

The identified equivalence relationships are a subset of OWL.

3.6 Semantic Validation

Semantic Validation of metadata involves checking whether it has a structure that conforms to the schema in use and checking whether it uses values that correspond to the particular data types or controlled vocabularies used.

In addition to checking whether the structure conforms to the schema, there may be other possible validation rules: for example instances of a certain class may be required to have certain properties; or if an instance has a certain property there may be restrictions on the other properties it can have etc. Such validation rules can either be enforced when metadata is entered into the system or during an ingesting stage if entry has occurred elsewhere.

It is desirable that Semantic Validation is performed on human entered data to guard against errors and inconsistencies. However there are limits to the type of validation that can be performed: for example with a controlled vocabulary, we can validate a property value conforms to the vocabulary but we cannot validate that it accurately reflects the real world. Therefore Semantic Validation should be performed on a ``best-effort'' basis to guard against errors and inconsistencies.

All true but should we fold all validation (syntactic, semantic, domain-specific) together into a single section

3.7 Merging

When different information sources are merged, whether they are different library catalogues of contact information from phones, PDAs or Outlook, there is always a danger of duplicate or conflicting information. There are two problems here: first how to identify the records to be merged, and second how to merge heterogeneous information sources.

In the second problem it is possible to distinguish a number of different situations: firstly you may have two very different pieces of information about the same object, for example when two different schemas have been used to describe the same piece of information. Secondly you may have two very similar pieces of information for example different versions of the same instance metadata. In either case, you may have duplicate information, which it may be possible to resolve automatically i.e. when one record is older than the other. Other times it will require human intervention to determine the correct merged record.

Why is "merging" separated out from transformation, augmentation and indeed modification? These seem more general.

3.8 Naming

Need (domain-specific) survey/catalog of existing and emerging approaches (handles, URNs/DDDS, other digital object schemes) in the area of SIMILE's use. A new, general mechanism is outside the scope of SIMILE.

Semantic web platform: Separate out what a semantic web platform needs to provide for the domain-specific approach(es) choosen. Validate URI/resource model; va;lidate the identifying properties approaches.

This section has useful discussion but as part of the research problem scoping, the discussion could move to a separate note.

How much content naming

Naming is the general problem of being able to refer to a specific resource. Here are some examples:

A film class has identified scenes in a movie as representative of a particular technique. The film class must have a way to name that region of the movie, even though it is smaller than the smallest asset that the repository tracks. For film it might be minutes, seconds, and frames bracketing the work.
An annotation annotates a specific ``contributor.author'' element within an instance of Dublin Core. By what mechanism does the reference occur?
A DC:Subject element contains a value drawn from a particular version of a controlled vocabulary. How is the vocabulary, and the value within it, referred to?
How can collections of instance metadata be referred to e.g. all of the OCLC-originated annotations to instances of Dublin Core for resource R?
Unique names are typically doled out by centralized authorities e.g. social security numbers and domain names. How does one locally come up with a name for a global resource? For example, suppose two libraries cataloged photographs of the Eiffel Tower. How do they name ``The Eiffel Tower'' in such a way that they are identified as being about the same thing?

There are two issues here. One is "how are things named" and the obvious answer is "URIs". The second one is "Should there be canonical URIs that can be deduced from what you are looking for"? It is proposed that the answer to the second is no. URIs should be opaque, and perhaps random to avoid collisions. The process of "figuring out the right URI for something" is a type of search/retrieval problem. Instead of squeezing this search/retrieval into a specialized "figure out the URL" task, incorporate it in the standard search framework. Any information used to define a canonical URL can instead be used as metadata on the object, and any knowledge of how to construct the URL can then be turned into a specification of its metadata.

One way that two parties can independently come up with names for a given resource is to use the MD5 hash of a collection of bits. This only applies to static resources that are, in some sense, entirely bits. Still, there are a lot of interesting resources that could be viewed this way, including audio CDs, DVDs, PDF-published works, email addresses, a (reified) RDF triple, and less formally email messages and digital photographs. Dynamic content and non-digital resources, like the Effiel Tower, cannot be named in this way. The nice thing about MD5 URLs is that they provide a canonical naming rule that reduces the odds of getting multiple names for the same object, reducing need for inference about equivalence.

One problem with MD5 sums is that the contents of the URL become immutably linked to the URL itself. Invariant documents have lots of nice features; distribution, cache control, and cache verification become trivial, but on the down side there is no consistent address for the top of tree of a document history. If you want to be able to modify your document after publishing its MD5-sum URL, then you will need other mechanisms to deal with this.

Quite : hence hashing is one-of-many ways of identifying things. It is not distinguished in the external service.

The other problem is when we are using URLs to describe documents and their subcomponents i.e. identifying resources smaller than the atomic document. Doing this with a URL is arguably convenient, in that it permanently binds the smaller object to its containing object, giving you the semantics that if you are looking for the smaller object it is a good subgoal to look for the containing object.

But what if the contained object is inside two distinct objects? Which URL is right? What if someone doesn't know the object is contained? They will give it a third URL.

Consider a DVD archive that contains the theatrical release of "The Lord of the Rings". The URL for this sample asset for the sake of argument is 'http://simile.org/the-lord-of-the-rings-theatrical-release.dvd'.

Now suppose I have created a DVD player that will read metadata describing any movie and use it to modify the way that movie is played back. For example, my DVD player can read metadata describing scenes that depict violence, and remove them during playback of the movie.

Obviously the metadata read by the DVD player will have to include data that identifies the parts of the overal movie that represent the selected content. Using a URL to represent the content is insufficient - we can't create new URL's for every possible subregion of a movie, and even if we did so, such an approach wouldn't help in finding an playing back parts of the movie that do not correspond to that URL.

Naming, as is being described in section 3.2.7, has nothing to do with the URL for the asset. The purpose of naming is to create a linkage between the metadata and the movie subregion.

Stepping out of our example, the purpose of Naming in this document is to represent other assets in ways that URLs cannot. Such linkages are neccessarily specific to the type of data being indexed so they cannot be generalized to a single technology, but that doesn't mean that we can't create a pattern around them.

While using URLs with semantics is one option, an alternative way to specify a particular subpart of the movie is with a blob of RDF eg, there is a resource foo (no semantics) and assertions "foo fragment-of the-lord-of-the-rings", "foo start-offset 300", and "foo end-offset 500". Whatever semantics I intend to place in the URL, I can instead, without any loss of expressive power, place in a blob of RDF statements. This leaves me with URLs containing no semantics at all, which has a consistency I like.

There are many different ways to represent the subgraph in question. You have broken it up into three statements (and an implied statement of the schema type), another implementation might use more statements or fewer. In addition there are many other types of documents that could be named, in whole or in part.

The point of the Naming discussion is to map those statements to their meaning, where the meaning is a subindex into a document. This makes Name a specialization of Class.

The issue here is not so much whether or not URNs are appropriate for each of the names, but rather:

by what mechanism are the names generated and assigned? which of the URNs are URI's, and which are URL's? How can I tell?

Here "A" could be a URL, but if I wish it to be location-independent I may assign a URN and use some mapping service (PURLs, Handles). "A" is a URL - do an http:get
"B" is probably a URN, not useful to attempt to resolve it. I must map to some query on the contents of the graph represented by contents of "A".
"C" could be either a URN or a URL. How do I find the schema? Not sure

3.9 Processing Models

This section does not reflect all the comments made on the mailing list. A web system needs a local model for associating ontologies with data so that higher level queries and access are supported.

This section has useful discussion but as part of the research problem scoping, the discussion could move to a separate note.

Separate up the platform issues from the domain issues.
Fold into the suggested "vocabularies/ontologies" section

Elsewhere there has been discussion of a schema-registry (c.f. KAON). This registry would support registration, modification/version, withdrawing (not necessarily deleting) of vocabularies/ontologies.

There is an problem of associating the right vocabularies with the metadata which is touched on but not highlighted.

One of the promises of the semantic web is that

if person A writes his data in one way, and person B writes her data in another way, as long as they have both used Semantic Web tools, then we can leverage those tools to merge data from A and data from B declaratively i.e. without having to rewrite the software used by A or by B, and without necessitating them to change their individual data sets.

One possible enabling technology for realising this is automated discovery, i.e. some mechanism that a processor can use to automatically configure itself so it can process a document or model. There are a number of different possible processing models for the Semantic Web.

The rest of this section is about generalised discovery.

Schema discovery via namespace processing model

The processor gets a piece of RDF, inspects the namespaces and it tries to retrieve the schema from the namespace. If it can retrieve the schema, it processes it and is able to map the DSpace information into another schema that it is familiar with e.g. Dublin Core.

This processing model is quite controversial because general consensus is that namespaces do not indicate schemas [Jel], [tag], [Braa], [Bri]. Just because a piece of RDF defines a namespace with a URI that uses HTTP, this does not mean that HTTP can be used to retrieve a schema because RDF does not formally require this. If there is nothing at that URI, the only way the processor will determine this is via a timeout which will cause any requests that are invoking the processor to also fail. In addition it is not clear which resource you should have at the HTTP address e.g. an XML Schema, an RDF Schema, an OWL ontology etc. There have also been proposals about how to overcome this e.g.

Use a urn if a namespace just indicates identity, whereas use http if it indicates identity and points to additional resources.
The RDF graph could define ``processing instructions'' that indicate how to process it. CC/PP [ccp], an application of RDF, does this as it provides a method for subgraph inclusion called defaults. Of course, applications defining their own ``processing instructions'' would not be sufficient, as these processing instructions would need to be standardised in order to support automated discovery.

It is possible to higlight this with some other processing models:

Resource directory discovery via namespace processing model

The processor receives a piece of RDF and it inspects the namespace used. It tries to retrieve a document from the HTTP address indicated by the namespace, that indicates all the resources available for this vocabulary e.g. RDF Schema, XML Schema, XForms, XSLT, HTML documentation etc. This solves the problem of needing to know what type resource should be at the namespace URI as we can support many different types of resources. The processor then uses these resources to try to help process the RDF.

For more details of this processing model see [Brab].

Schema discovery via processing instruction processing model

The processor receives a piece of RDF, and it inspects the RDF model for RDF statements using a standardised processing model namespace. These statements give processing instructions about how to the process the model. The processor follows these instructions e.g. retrieves the relevant schemas. It then uses this information to process the RDF as outlined in the previous processing models.

The processing instruction processing model could be used in conjunction with a resource directory to leverage XML for the SW, as if we just add processing instructions to XML we can keep our data in XML, but the processing instruction points at a RDDL document that points to an XSLT stylesheet that converts the XML to RDF/XML, so the data is now SW compatible. Using RDDL it can also retrieve a large bunch of other resources.

Schema discovery via namespace with transport dependence processing model

When the processor receives a piece of RDF, it inspects the namespace used. If the namespace starts with HTTP, this indicates a resource is retrievable from that address. If it starts with another transport, e.g. URN, then it regards the namespace as simply defining identity. In the event of a retrievable resource it retrieves it and uses it to process as necessary.

One of the problems with starting to consider processing models is there is a big overlap between this area and general web architecture issues. For example, one proposed principle of good web design is ``cool URIs don't change'' [BL]. Although this seems good advice for web pages it can create problems when dealing with metadata. Imagine if we create a schema but it contains an error, but we do not find that out until after publication then we cannot fix it because we cannot change the contents of the URI. The only option is to correct and republish all the data and the schema associated with a new namespace. We may also want to update schemas even if they are correct, for example to provide interoperability between the schema and a newer schema. Here the problem is we can not add additional data to the schema once we've created it. Depending on how we see these constraints, we may want to adopt a processing model that uses some form of dereferencing as PURL does:

Schema discovery via dereferenced namespace

The processor recieves a piece of RDF, and inspects the namespace used. It queries this namespace with an intermediate server that stores the dereferences. The server could be identified via the namespace e.g. as in PURL or some other approach could be used. The dereference points to a particular schema, optionally on another server. This server could contain several dated versions of the schema, but the derefence just points to the most up to date one.

Then if we want to update the schema so it has additional information that maps it onto a newly released version of Dublin Core, we can do so because the contents of URIs never change, but the contents of the dereferenced URIs do.

Load schemas on startup processing model

In OWL, the processor loads an OWL ontology that can use includes to load other OWL ontologies. But there is no way to automatically load ontologies on demand, so ontologies have to be explicitly configured. This design decision is deliberate as you can not combine ontologies arbitarily as you need to do consistency checks first. Typically this is done at ontology creation time.

This processing model may be applicable to other processors apart from OWL processors: for example today CC/PP processors load a set of schemas at start-up time. Then when they receive RDF, it makes a best attempt to process it. If they recognise it via the startup schema, they process it. If not, they try to process it but at the end of the day if the schema is not recognised responsiblity passes to the application sitting on the processor. However it is fairly easy to reconfigure the processor to deal with new schemas, it's just a matter of changing some kind of configuration script. This allows whoever is configuring the processor to do some kind of ``quality control'' on the schemas.

One of the things that any web system, Semantic or otherwise, has to cope with is that the web is sufficiently large that it does not work at once. Any system that depends on others also has to take into account that there will be temporary problems. So the ground rules are ontologies may not be perfect and connectivity is not guaranteed so systems do the best they can in the circumstances. There is value in working with information from other systems but it has implications. In RDF, lack of a schema for a namespace does not stop the system doing anything with it. For example if a processor receives a piece of RDF that uses the WebLibraryTerms namespace, but that is not a namespace the system knows about. What does it do? It can decide to read from the namespace URL or it can choose not to. While good style says that the schema should be available at the namespace place indicated, it may not be. However it may be necessary to retrieve the schema to answer certain questions, as they may require inferencing. Therefore the answer to queries on the semantic web are not ``yes'' and ``no'', they are ``yes'' and ``do not know''. There is no global processing model or global consistency. There are local decisions on what to do about things.

Maybe some community using SIMILE does know something about WebLibraryTerms so it can do something useful with this information. The fact the server does not fully understand all the implications of the data is not important. Later, the community can ask the SIMILE system to install WebLibraryTerms so they can do their searches on the server side if it does not automatically read it or log the fact that an unknown namespace was used a lot and the admins have already decided to get it.

So a key question: how often do new schemas change? How often do unknown schemas turn up? If a new schema arises and is important to some community of SIMILE uses, they ask the system to use that schema. It may not happen immediately and it may involve a person doing some configuration, but it does deliver useful value to people in the presence of a less than perfect global Semantic Web.

Specifically for the history store, I would expect it to have a site cache of schema/ontologies, indexed by namespace. If some schema contain errors at a site, then they are not used. A cache is prudent because even if the namespace does reside at its namespace URL, using HTTP, it may be unreachable just at the moment it is needed. Schemas are slow changing things so using a cached copy seems sensible, and this can be a fixed version if the master copy is trivially broken. The use of the cache is a local choice.

If a new schema is encountered, say it has a property foo:articleTitle that is equivalent to dc:title, then until the system uses a rule that these are equivalent it treats them as different.

Is foo:articleTitle really, truly, exactly equivalent to dc:title? It depends. It depends on who is asking, it depends on what they want the information for. A good system admits these alternatives and does the best it can in the circumstances.

In many ways these issues are a consequence of being a large federation, rather than a centrally managed system. Because SIMILE has an ingest and validation process, I hope that is seen by other systems as a high quality source of information. If it is perceived as such it will get used; if it is not seen as such, it will not get used.

3.10 Classification

One important issue is classification, but it has several different axes:

Metadata versus original versus abstract object: Classification has different implications depending on whether we are classifying the original object i.e. declaring the format, the metadata i.e. declaring the schema in use, or the abstract object i.e. classifying the object type.
Explicit versus implicit: Sometimes classifications are explicit in the metadata, other times they may be determined via inference or via inspecting the content that the metadata refers to.
Subjective versus objective: When we classify things, some classifications seem objective whereas others seem subjective although in reality though there is a continuum between the two. For example consider the increasing subjectivity of the following classifications: ``The Matrix'' as a film, as a science fiction film, and as a good film.

Not sure this isn't covered in, for example, the vocabularies discussion.

4 Dissemination

The exact form is domain-specific as it delivers services to domain clients. The Semantic Web platform will need to provide the right facilities to relise the domain services.

Need input from Haystack group and MIT library as to the scope of this area for SIMILE.

Dissemination refers to how the content and the metadata is presented to users and to automated services that may then perform additional operations on this information.

**Figure 2:** Dissemination

4.1 To Humans

Current thinking is to have an ontology describing how metadata is supposed to be viewed e.g. what do people want to see, what is only interesting to agents, what uniquely defines object, etc for more details see [HQK] and [QKH].

4.2 To software / agents

Policy-compliant dissemination

5 Distributed Resources

This section has three classes of issues:

Multiple SIMILE systems, independent, in a federation
Multiple site single SIMILE system
Single site, multi-machine SIMILE system

as well as the client-server relationship which is about the relationship of producer and consumer on the (Semantic) Web.

Another key problem that SIMILE needs to deal with is distributed resources for example:

Distributed content assets when content is spread between a number of repositories or Websites.
Distributed metadata stores where the metadata is spread between a number of different metadata repositories but mechanisms need to exist to query those repositories in a coherent and organized way.
Mixtures of both distributed content and metadata services.

Some relevant issues here are:

How does the distributed system manage its metadata internally?
How does it perform dissemination?
How does distribution occur?
Where does the user functionality reside?
What transport mechanisms are used?
What is the higher-level distribution strategy?
How rich is the presentation given to the user and how is it provided e.g. smart versus thin client?

Note federation has a very specific meaning and implies no overlap between regions with distinct ownership. If you have to send your queries to different databases and reassemble results then that is a different distribution problem. Federation is a specific solution that revolves around knowing where data is located so duplication is not a problem. Any duplicate data is simply cache data. In the more general case, if you do not know where the data you want is, this is a much harder problem. Centralising data is a way of addressing this, or giving the impression of centralised data e.g. Google.

There are several different modes for distribution:

Interactions from client to the SIMILE system.
Interactions between machines making the SIMILE installation at one institution.
Peer-level ingest and publish. Interactions with the system in the form of a SIMILE installation at one institution with a SIMILE installation at another. In this interface each server owns its own data and is not responsible for the data in its peers.
Priveleged peer-level based on one or more of data replication, mirroring, and federation. In this interface each server cooperates to present an interface to a logically combined metadata store.
Client-level data cache. In this interface the client may gain temporary leases on data records before returning them with changes to the server, but the server owns all of the data.
UI View level cache. In this interface the client controller and model run within the application server that hosts the server. Cache control is through the standard web protocols for HTTP caching.

There are a couple of distinctions between using a system like SIMILE through a web browser and using it through a client like Haystack:

Haystack provides different views on data. It separates the data from the view. So in a way it is a possible successor to a web browser.
Haystack brings information closer to the user as it integrates information stored on the local machine with information on the server. HP has also explored this kind of architecture [hpl].

There is an open question here about whether SIMILE should consider distributed resources as they create a whole different scope and open up a host of extra problems. One way of avoiding this would be to devise a simple block-transfer protocol that supports getting all the metadata to a single location and dealing with it there. Since in the centralized scenario everything is hard, perhaps it is better to defer distributed search?

There is some disagreement about whether SIMILE a client server application or whether it should be a service or system that publishes information to the Semantic Web, without necessarily providing ways for the user to interact with that information. This is because the Semantic Web is not a number of closed worlds so information from SIMILE will be reused by other systems, whether portals, client-end applications or something else. This leads to questions like how does SIMILE use Semantic Web information from elsewhere such as dynamic information? The SIMILE client could do such integration or it could be made possible as part of the service architecture within SIMILE. SIMILE could choose just to be a ``leaf node'' in the Semantic Web providing information but not consuming it from other Semantic Web sources. This would still be valuable and should be the focus of initial demonstrators but in the long term it limits the ability to evolve as the way we use information changes.

Just using RDF as a transport format is not really utilizing RDF because the semantics are hidden in the internal processed representation and not necessarily preserved on converting into and out of RDF. As it is the internal semantics that matter you might as well use XML for a transport as you are relying on the converters to maintain the semantics across the Web.
SIMILE is not a standard web architecture application. That would make the functionality of the whole thing one of defining the future uses and building it. It may be as a set of RDF stores, with a variety of services clustering around which transform RDF, some services that use information from SIMILE and other sources in support of some community, and also some services that are concerned with presentation. Web browsers are one way of displaying information.

SIMILE is a semantic web platform.

5.1 Query Performance and Scaling

What factors most affect semi-structured query performance?
What does scaling mean in a distributed semi-structured query system? How should scaling be measured?
What factors most affect scaling of distributed query?
What is the appropriate scope of languages for semi-structured query?
How expressive do queries need to be to cover practical use cases?
What is the relationship between various possible query constructs, performance, and scaling?

Significant area for real systems

Different of scaling on one system from scaling the SIMILE federation.

5.2 Necessary and Sufficient Constructs

How expressive must models be to cover practical, useful, and sale-able use cases?
Which use cases can be adequately covered by fully structured technologies (e.g. XML Databases)?
Which use cases require the simplest levels of the Semantic Web technology stack (RDF, RDFS)?
Which use cases require richer modeling (DAML, WebOnt, Description Logic)?

Not distribution

5.3 Locus of distribution

How do individuals, communities, institutions & enterprises differ in their assignation of value to information? Is there an appropriate locus of distribution for various types of information?
What choices exist for repositories themselves be described, discovered, and opted in/out of the scope of the query? What are the performance and scaling implications of each choice?
What is the relationship between information produced by agents on behalf of the individual, community, institution, or enterprise about a particular asset, and dissemination or transformation services on that asset? What techniques can be used to discover such information? What are the implications of different choices about how to distribute such information across various repositories?

6 User Experience

Not a SIMILE/server area.

Recommend one or more separate documents from the client side and domain

side, including user experience.

There is an issue of vocabulary design/building/reusing. This is self-contained.

A key issue for dealing with semi-structured metadata stores using disparate, evolving ontologies is how to support end-user navigation and interaction in an intuitive way.

6.1 Discovery Aids

When users are creating metadata, generally there are existing ontologies that describe at least a subset of their problem domain. Therefore, one issue is how to assist users in discovering suitable ontologies for marking up metadata. One difficulty here is the complexity of some of the schema or ontology languages. Another difficulty is that are generally located at disparate locations, and sometime are competing with one another i.e. they address similiar domains. Therefore, it may be desirable to have a repository of schemas, that presents them to the user in a simplified way in order to help the user select a suitable schema. For example it may hide the syntax used to express the schema, such as RDFS or OWL from the user and instead present the schema graphically.

6.2 Simplify

Applying complex classification schemes on resources could negatively impact users' ability to search for resources. It is important to hide unnecessary detailS until userS need it. This may be done in several ways:

By providing default behaviors that allow users to carry out typical tasks while minimizing specialized knowledge, e.g. role-specific views on resources.
By using inferences to hide some of the complexities of schemas and vocabularies from users.

6.3 Avoid repetition

There are many tasks that may involve unnecessary repetitions, for example:

If collections have separate search interfaces it may be necessary for a user to repeat the same query multiple times on different systems.
It may be desirable to provide mechanism to allow users to update several records simultaneously as a set rather than individually when performing instance versioning.
If possible, tasks like merging of records or mapping between schemas and vocabularies should be automatic and only require user intervention when absolutely necessary.

6.4 Guide user

Users may like to receive guidance in a number of tasks during different stages of the information lifecycle. One solution is to use discovery aids as outlined above. Another way is to use techniques such as wizards that guide users step-by-step through complex tasks.

6.5 Pool users' expertise

Electronic retailers like Amazon use recommendation systems to assist users and to guide them to resources that they may be interested in. These systems work by analyzing what resources users search for (and in the case of Amazon, purchase), looking for similarities with other uses and then making recommendations based on the items other users have searched for.

There are some limitations with the current versions of such systems. Most notably they have no way for a user to denote the context for their search: therefore on Amazon a user may search for very different items if they are purchasing an item for a relative compared to when they are purchasing items for themselves. Therefore making recommendations based on the entire users history may not be as effective as making recommendations based on recent search terms from the user. Also there are potential privacy issues that need to be addressed when recording user behavior, whether it is occurring with or without their knowledge.

6.6 Policy Expression

6.7 Misc

What user interface paradigms and metaphors are useful to end users in interacting with such systems? In schema creation? In schema discovery and use?
What techniques can reduce the perceived end-user cost of participation in a collaborative information system?
How can a Semantic Web of schemas and metadata be effectively managed?
Given a particular schema, what techniques can be used to identify schemas that are likely related to it?
Given a particular agent or end-user, what techniques can be used to identify schemas that the user is likely to find valuable?
How can schema transformations and relationships be described in query-able form?
Can schema mapping techniques be demonstrated and provide end-user value in a real system?
Are there techniques or best practices (e.g. schema annotation) that can explicitly help balance convergence and specialization?

Bibliography

BL: Tim Berners-Lee.
Cool uris don't change.
http://www.w3.org/Provider/Style/URI.html.
Braa: Tim Bray.
Architectual theses on namespaces and namespace documents.
http://www.textuality.com/tag/Issue8.html.
Brab: Tim Bray.
Resource directory description language.
http://www.textuality.com/xml/rddl2.html.
Bri: Dan Brickley.
Namespace dereferencing.
http://lists.xml.org/archives/xml-dev/200012/msg00680.html.
ccp: Composite capabilities / preferences profiles.
http://www.w3c.org/mobile/ccpp/.
cre: Creating a controlled vocabulary.
http://www.boxesandarrows.com/archives/creating_a_controlled_vocabulary.php.
FLS: Karl Fast, Fred Leise, and Mike Steckel.
What is a controlled vocabulary?
http://www.boxesandarrows.com/archives/what_is_a_controlled_vocabulary.php.
Gru: Thomas Gruber.
What is an ontology?
http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.
hpl: Hp labs technical report hpl-2002-328.
http://www.hpl.hp.com/techreports/2002/HPL-2002-328.html.
HQK: David F. Huynh, Dennis Quan, and David R. Karger.
User interaction experience for Semantic Web information.
http://haystack.lcs.mit.edu/papers/www2003-ui.pdf.
ims: IMS Instructional Media Services.
http://www.imsglobal.org/.
Jel: Rick Jelliffe.
Why doesn't this solve the namespace problem.
http://lists.xml.org/archives/xml-dev/200012/msg00741.html.
MAR: MARC Machine Readable Cataloguing.
http://www.loc.gov/marc/.
mod: Metadata Object Description Scheme.
http://www.loc.gov/standards/mods/.
NM: Natalya F. Noy and Deborah L. McGuinness.
Ontology development 101: A guide to creating your first ontology.
http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html.
QKH: Dennis Quan, David R. Karger, and David F. Huynh.
RDF authoring environments for end users.
http://haystack.lcs.mit.edu/papers/swfat2003.pdf.
rdf: RDF primer: How to interpret schema.
http://www.w3.org/TR/rdf-primer/#interpretingschema.
tag: What should a namespace document look like?
http://www.w3.org/2001/tag/ilist#namespaceDocument-8.
vic: Victorian electronic records strategy.
http://www.prov.vic.gov.au/vers/published/final/finala6.pdf.

marbut 2003-06-16