Andy's comments thus.
These comments are part of the process of preparing the "relevant technologies" document for the technical plenary 23-24 July 2003.
Some structure might help the plenary discussions:
There is a Semantic Web platform layer which covers:
There is a Content Layer which covers:
There is a Domain Layer (Application Layer?) which covers:
There is a Client Layer which covers what SIMILE clients do. The server is presumably rather neutral to the nature of the clients.
In looking through the document, I can identify some significant areas for prototyping and investigation:
As to (2), one approach is to use a continuous prototyping development style, whereby each cycle of the platform development implements just enough to get some usage working. This is a style from "Extreme Programming". This avoids the "mega-design" effect whereby the platform is expected to do everything. Instead, the objective is to get something, how ever small, working so feedback from the components on top of the platform are based on a concrete system, not a paper design. Each cycle should be short.
Mick Bass mailto:mick.bass@hp.com
Mark H. Butler mailto:mark-h.butler@hp.com
June 16, 2003
This isn't about relevant technologies so much as being about the problem space. For many huge areas, SIMILE's appraoch will be to survey existing techniques.
Missing:
The SIMILE team, working with the SIMILE PIs, has identified a number of motivating problems for SIMILE. These motivating problems are summarized in the mindmap shown in Figure 1 and will be detailed in more depth in the subsequent text.
Currently one of the roles of DSpace, and hence SIMILE, is to act as a content repository. Key problems here include how to leverage that content better:
Content Augmentation : subcase of content transformation
I don't see the difference here between content change and metadata change.
Metadata contains information about content, either from human input or from extraction or generation processes so key problems around metadata include:
This is an good example where is it diffcult to differentiate between content and metadata. Some metadata is tightly bound to the content it referes to, even in the content bitstream itself (e.g. Adobe's XMP).
Good example of an external (non-core) service that fits into a metadata generation service model whereby plugin services can add/refine metadata from the base inputs. Other example include the augmentation/transformation of metadata and content.
May need something about provenance of metadata : this would require support from the base platform, or at least an idea that the base platform is sufficient.
Some reorganistion might help here:
One of the drivers for SIMILE is coping with schema diversity, which occurs due to differences in the intended uses of instance data. For example, consider the differences between schemas used by services or agents to describe services that can operate on content or metadata by transforming or enhancing it in some way with preservation schemas that describe how digital resources are preserved i.e. their component parts, their inter-relationship and the changes they have undergone. Other possible applications of schemas include technical schemas, presentation schemas and policy schemas describing the type of operations that can be performed on the content and the metadata.
Another driver of schema diversity is the community-driven forking, i.e. where different schemas have the same basic purposes but have evolved within or serve different communities. For example both BibTex and Dublin Core are applied to a similiar subject area, but they have differences because these standards are aimed at different applications. Other examples here include IMS [ims], MARC [MAR] and MODS [mod].
There was going to be a comparison of various metadata standards - this would be a useful input into scoping SIMILE work in the metadata and vocabulary areas to have a brief comparision here, not just links.
One key task involved in the creation of metadata is creating schemas and ontologies i.e. formalisms that capture that metadata. There are two axes that define metadata: the schema in use i.e. the metadata structure and the controlled vocabularies in use. Sometimes the schema and the vocabularies are controlled by a single standard but in other instances they evolve independently. Schemas have a specific lifecycle with distinct stages:
See comment in 3.1
This is the vocabulary lifecycle. The metadata and content lifecycle is covered by the history store; can that be reused here in anyway?
First, as already noted, Semantic Web languages like RDFS and OWL enable you to describe the classes and properties that make up some vocabulary and the relationship between them, rather than syntactically validate a record. Thus they are not really schema languages but rather vocabulary definition languages. You can do data validation against them but you have flexibility in how do that. A good explanation of this is given in the RDF primer [rdf]. This allows you to describe the features of your vocabulary and yet separate that from the policies of how data is be validated against that vocabulary i.e. whether to allow partial records.
Second, these languages tend to be property-centric. This means you can add a definition of a new property without necessarily invalidating any of your current property or class descriptions. Contrast this to, say, object oriented languages where once you have described the slots on an object you can not add another slot without creating a new sub-class.
Thirdly, types i.e. classes in RDF do not need to be exclusive. That means that a resource can be an instance of several overlapping classes at the same time, in particular it can be both an instance of a 1.0 version of some class and a 1.1 version of the class, at the same time.
The terms "open" schema is confusing:
However not all the issues around open schemas are currently solved. For example extending instance data without providing a concrete schema is often undesirable and as the section on processing models highlights, mechanisms for finding schemas for a given piece of instance data are not yet standardised.
One of the key technologies of the Semantic Web is using either schema languages or ontology languages to perform inferencing i.e. automatically deduce additional assertions from existing assertions. A basic form of inferencing just deals with class and property hierarchies as defined by RDFS i.e. relationships with transitive closure. For example we might define that the class dolphin is a subclass of the class mammal. If the dolphin class also had a property that indicated that dolphins can swim, then we could search for resources on swimming mammals and we could identify dolphin as meeting our criteria. Using inferencing in this way is referred to as inferencing search and it can be used to interoperate between different schema: imagine we have several heterogeneous schemas and a standardized schema. If we can define class and property relationships between the heterogeneous schemas and the standardized schema, then we can use inferencing search to search metadata in the hetereogenous formats using terms from the standardized schema.
Ontology languages like OWL allows a richer set of inferences than RDFS based on cardinality constraints, transitivity relationships, equivalence of classes and individuals, functional relationships, class combinations using unions and intersections and disjoint relationships between classes. Of these relationships, equivalence is particularly important for for interoperability. There are three types of equivalence that are relevant:
Equivalence also has transitive closure so can be considered part of basic inferencing search. The other relationships do not necessarily have transitive closure, therefore inferencing using these relationships may be NP-hard or even undecidable in terms of computational complexity. Hence these relationships may not be considered as part of basic inferencing search capability.
There are two types of issues that can occur for both schemas and vocabularies: the first is how to deal with different versions of the schema and vocabulary and the second is how to map between different schemas and vocabularies. Here are some examples:
The identified equivalence relationships are a subset of OWL.
Semantic Validation of metadata involves checking whether it has a structure that conforms to the schema in use and checking whether it uses values that correspond to the particular data types or controlled vocabularies used.
In addition to checking whether the structure conforms to the schema, there may be other possible validation rules: for example instances of a certain class may be required to have certain properties; or if an instance has a certain property there may be restrictions on the other properties it can have etc. Such validation rules can either be enforced when metadata is entered into the system or during an ingesting stage if entry has occurred elsewhere.
It is desirable that Semantic Validation is performed on human entered data to guard against errors and inconsistencies. However there are limits to the type of validation that can be performed: for example with a controlled vocabulary, we can validate a property value conforms to the vocabulary but we cannot validate that it accurately reflects the real world. Therefore Semantic Validation should be performed on a ``best-effort'' basis to guard against errors and inconsistencies.
When different information sources are merged, whether they are different library catalogues of contact information from phones, PDAs or Outlook, there is always a danger of duplicate or conflicting information. There are two problems here: first how to identify the records to be merged, and second how to merge heterogeneous information sources.
In the second problem it is possible to distinguish a number of different situations: firstly you may have two very different pieces of information about the same object, for example when two different schemas have been used to describe the same piece of information. Secondly you may have two very similar pieces of information for example different versions of the same instance metadata. In either case, you may have duplicate information, which it may be possible to resolve automatically i.e. when one record is older than the other. Other times it will require human intervention to determine the correct merged record.
Need (domain-specific) survey/catalog of existing and emerging approaches (handles, URNs/DDDS, other digital object schemes) in the area of SIMILE's use. A new, general mechanism is outside the scope of SIMILE.
Semantic web platform: Separate out what a semantic web platform needs to provide for the domain-specific approach(es) choosen. Validate URI/resource model; va;lidate the identifying properties approaches.
This section has useful discussion but as part of the research problem scoping, the discussion could move to a separate note.
How much content naming
Naming is the general problem of being able to refer to a specific resource. Here are some examples:
There are two issues here. One is "how are things named" and the obvious answer is "URIs". The second one is "Should there be canonical URIs that can be deduced from what you are looking for"? It is proposed that the answer to the second is no. URIs should be opaque, and perhaps random to avoid collisions. The process of "figuring out the right URI for something" is a type of search/retrieval problem. Instead of squeezing this search/retrieval into a specialized "figure out the URL" task, incorporate it in the standard search framework. Any information used to define a canonical URL can instead be used as metadata on the object, and any knowledge of how to construct the URL can then be turned into a specification of its metadata.
One way that two parties can independently come up with names for a given resource is to use the MD5 hash of a collection of bits. This only applies to static resources that are, in some sense, entirely bits. Still, there are a lot of interesting resources that could be viewed this way, including audio CDs, DVDs, PDF-published works, email addresses, a (reified) RDF triple, and less formally email messages and digital photographs. Dynamic content and non-digital resources, like the Effiel Tower, cannot be named in this way. The nice thing about MD5 URLs is that they provide a canonical naming rule that reduces the odds of getting multiple names for the same object, reducing need for inference about equivalence.
One problem with MD5 sums is that the contents of the URL become immutably linked to the URL itself. Invariant documents have lots of nice features; distribution, cache control, and cache verification become trivial, but on the down side there is no consistent address for the top of tree of a document history. If you want to be able to modify your document after publishing its MD5-sum URL, then you will need other mechanisms to deal with this.
Quite : hence hashing is one-of-many ways of identifying things. It is not distinguished in the external service.
The other problem is when we are using URLs to describe documents and their subcomponents i.e. identifying resources smaller than the atomic document. Doing this with a URL is arguably convenient, in that it permanently binds the smaller object to its containing object, giving you the semantics that if you are looking for the smaller object it is a good subgoal to look for the containing object.
But what if the contained object is inside two distinct objects? Which URL is right? What if someone doesn't know the object is contained? They will give it a third URL.
Consider a DVD archive that contains the theatrical release of "The Lord of the Rings". The URL for this sample asset for the sake of argument is 'http://simile.org/the-lord-of-the-rings-theatrical-release.dvd'.
Now suppose I have created a DVD player that will read metadata describing any movie and use it to modify the way that movie is played back. For example, my DVD player can read metadata describing scenes that depict violence, and remove them during playback of the movie.
Obviously the metadata read by the DVD player will have to include data that identifies the parts of the overal movie that represent the selected content. Using a URL to represent the content is insufficient - we can't create new URL's for every possible subregion of a movie, and even if we did so, such an approach wouldn't help in finding an playing back parts of the movie that do not correspond to that URL.
Naming, as is being described in section 3.2.7, has nothing to do with the URL for the asset. The purpose of naming is to create a linkage between the metadata and the movie subregion.
Stepping out of our example, the purpose of Naming in this document is to represent other assets in ways that URLs cannot. Such linkages are neccessarily specific to the type of data being indexed so they cannot be generalized to a single technology, but that doesn't mean that we can't create a pattern around them.
While using URLs with semantics is one option, an alternative way to specify a particular subpart of the movie is with a blob of RDF eg, there is a resource foo (no semantics) and assertions "foo fragment-of the-lord-of-the-rings", "foo start-offset 300", and "foo end-offset 500". Whatever semantics I intend to place in the URL, I can instead, without any loss of expressive power, place in a blob of RDF statements. This leaves me with URLs containing no semantics at all, which has a consistency I like.
There are many different ways to represent the subgraph in question. You have broken it up into three statements (and an implied statement of the schema type), another implementation might use more statements or fewer. In addition there are many other types of documents that could be named, in whole or in part.
The point of the Naming discussion is to map those statements to their meaning, where the meaning is a subindex into a document. This makes Name a specialization of Class.
The issue here is not so much whether or not URNs are appropriate for each of the names, but rather:
by what mechanism are the names generated and assigned? which of the URNs are URI's, and which are URL's? How can I tell?
This section does not reflect all the comments made on the mailing list. A web system needs a local model for associating ontologies with data so that higher level queries and access are supported.
This section has useful discussion but as part of the research problem scoping, the discussion could move to a separate note.
Elsewhere there has been discussion of a schema-registry (c.f. KAON). This registry would support registration, modification/version, withdrawing (not necessarily deleting) of vocabularies/ontologies.
There is an problem of associating the right vocabularies with the metadata which is touched on but not highlighted.
One of the promises of the semantic web is that
if person A writes his data in one way, and person B writes her data in another way, as long as they have both used Semantic Web tools, then we can leverage those tools to merge data from A and data from B declaratively i.e. without having to rewrite the software used by A or by B, and without necessitating them to change their individual data sets.One possible enabling technology for realising this is automated discovery, i.e. some mechanism that a processor can use to automatically configure itself so it can process a document or model. There are a number of different possible processing models for the Semantic Web.
The rest of this section is about generalised discovery.
The processor gets a piece of RDF, inspects the namespaces and it tries to retrieve the schema from the namespace. If it can retrieve the schema, it processes it and is able to map the DSpace information into another schema that it is familiar with e.g. Dublin Core.
This processing model is quite controversial because general consensus is that namespaces do not indicate schemas [Jel], [tag], [Braa], [Bri]. Just because a piece of RDF defines a namespace with a URI that uses HTTP, this does not mean that HTTP can be used to retrieve a schema because RDF does not formally require this. If there is nothing at that URI, the only way the processor will determine this is via a timeout which will cause any requests that are invoking the processor to also fail. In addition it is not clear which resource you should have at the HTTP address e.g. an XML Schema, an RDF Schema, an OWL ontology etc. There have also been proposals about how to overcome this e.g.
urn
if a namespace just indicates identity, whereas use
http
if it indicates identity and points to additional resources.
It is possible to higlight this with some other processing models:
Resource directory discovery via namespace processing model
The processor receives a piece of RDF and it inspects the namespace used. It tries to retrieve a document from the HTTP address indicated by the namespace, that indicates all the resources available for this vocabulary e.g. RDF Schema, XML Schema, XForms, XSLT, HTML documentation etc. This solves the problem of needing to know what type resource should be at the namespace URI as we can support many different types of resources. The processor then uses these resources to try to help process the RDF.
For more details of this processing model see [Brab].
Schema discovery via processing instruction processing model
The processor receives a piece of RDF, and it inspects the RDF model for RDF statements using a standardised processing model namespace. These statements give processing instructions about how to the process the model. The processor follows these instructions e.g. retrieves the relevant schemas. It then uses this information to process the RDF as outlined in the previous processing models.
The processing instruction processing model could be used in conjunction with a resource directory to leverage XML for the SW, as if we just add processing instructions to XML we can keep our data in XML, but the processing instruction points at a RDDL document that points to an XSLT stylesheet that converts the XML to RDF/XML, so the data is now SW compatible. Using RDDL it can also retrieve a large bunch of other resources.
Schema discovery via namespace with transport dependence processing model
When the processor receives a piece of RDF, it inspects the namespace used. If the namespace starts with HTTP, this indicates a resource is retrievable from that address. If it starts with another transport, e.g. URN, then it regards the namespace as simply defining identity. In the event of a retrievable resource it retrieves it and uses it to process as necessary.
One of the problems with starting to consider processing models is there is a big overlap between this area and general web architecture issues. For example, one proposed principle of good web design is ``cool URIs don't change'' [BL]. Although this seems good advice for web pages it can create problems when dealing with metadata. Imagine if we create a schema but it contains an error, but we do not find that out until after publication then we cannot fix it because we cannot change the contents of the URI. The only option is to correct and republish all the data and the schema associated with a new namespace. We may also want to update schemas even if they are correct, for example to provide interoperability between the schema and a newer schema. Here the problem is we can not add additional data to the schema once we've created it. Depending on how we see these constraints, we may want to adopt a processing model that uses some form of dereferencing as PURL does:
Schema discovery via dereferenced namespace
The processor recieves a piece of RDF, and inspects the namespace used. It queries this namespace with an intermediate server that stores the dereferences. The server could be identified via the namespace e.g. as in PURL or some other approach could be used. The dereference points to a particular schema, optionally on another server. This server could contain several dated versions of the schema, but the derefence just points to the most up to date one.
Then if we want to update the schema so it has additional information that maps it onto a newly released version of Dublin Core, we can do so because the contents of URIs never change, but the contents of the dereferenced URIs do.
Load schemas on startup processing model
In OWL, the processor loads an OWL ontology that can use includes to load other OWL ontologies. But there is no way to automatically load ontologies on demand, so ontologies have to be explicitly configured. This design decision is deliberate as you can not combine ontologies arbitarily as you need to do consistency checks first. Typically this is done at ontology creation time.
This processing model may be applicable to other processors apart from OWL processors: for example today CC/PP processors load a set of schemas at start-up time. Then when they receive RDF, it makes a best attempt to process it. If they recognise it via the startup schema, they process it. If not, they try to process it but at the end of the day if the schema is not recognised responsiblity passes to the application sitting on the processor. However it is fairly easy to reconfigure the processor to deal with new schemas, it's just a matter of changing some kind of configuration script. This allows whoever is configuring the processor to do some kind of ``quality control'' on the schemas.
One of the things that any web system, Semantic or otherwise, has to cope with is that the web is sufficiently large that it does not work at once. Any system that depends on others also has to take into account that there will be temporary problems. So the ground rules are ontologies may not be perfect and connectivity is not guaranteed so systems do the best they can in the circumstances. There is value in working with information from other systems but it has implications. In RDF, lack of a schema for a namespace does not stop the system doing anything with it. For example if a processor receives a piece of RDF that uses the WebLibraryTerms namespace, but that is not a namespace the system knows about. What does it do? It can decide to read from the namespace URL or it can choose not to. While good style says that the schema should be available at the namespace place indicated, it may not be. However it may be necessary to retrieve the schema to answer certain questions, as they may require inferencing. Therefore the answer to queries on the semantic web are not ``yes'' and ``no'', they are ``yes'' and ``do not know''. There is no global processing model or global consistency. There are local decisions on what to do about things.
Maybe some community using SIMILE does know something about WebLibraryTerms so it can do something useful with this information. The fact the server does not fully understand all the implications of the data is not important. Later, the community can ask the SIMILE system to install WebLibraryTerms so they can do their searches on the server side if it does not automatically read it or log the fact that an unknown namespace was used a lot and the admins have already decided to get it.
So a key question: how often do new schemas change? How often do unknown schemas turn up? If a new schema arises and is important to some community of SIMILE uses, they ask the system to use that schema. It may not happen immediately and it may involve a person doing some configuration, but it does deliver useful value to people in the presence of a less than perfect global Semantic Web.
Specifically for the history store, I would expect it to have a site cache of schema/ontologies, indexed by namespace. If some schema contain errors at a site, then they are not used. A cache is prudent because even if the namespace does reside at its namespace URL, using HTTP, it may be unreachable just at the moment it is needed. Schemas are slow changing things so using a cached copy seems sensible, and this can be a fixed version if the master copy is trivially broken. The use of the cache is a local choice.
If a new schema is encountered, say it has a property foo:articleTitle that is equivalent to dc:title, then until the system uses a rule that these are equivalent it treats them as different.
Is foo:articleTitle
really, truly, exactly equivalent to
dc:title? It depends. It depends on who is asking, it depends on what they want
the information for. A good system admits these alternatives and does the best
it can in the circumstances.
In many ways these issues are a consequence of being a large federation, rather than a centrally managed system. Because SIMILE has an ingest and validation process, I hope that is seen by other systems as a high quality source of information. If it is perceived as such it will get used; if it is not seen as such, it will not get used.
One important issue is classification, but it has several different axes:
Not sure this isn't covered in, for example, the vocabularies discussion.
The exact form is domain-specific as it delivers services to domain clients. The Semantic Web platform will need to provide the right facilities to relise the domain services.
Need input from Haystack group and MIT library as to the scope of this area for SIMILE.
Dissemination refers to how the content and the metadata is presented to users and to automated services that may then perform additional operations on this information.
Current thinking is to have an ontology describing how metadata is supposed to be viewed e.g. what do people want to see, what is only interesting to agents, what uniquely defines object, etc for more details see [HQK] and [QKH].
Policy-compliant dissemination
This section has three classes of issues:
as well as the client-server relationship which is about the relationship of producer and consumer on the (Semantic) Web.
Another key problem that SIMILE needs to deal with is distributed resources for example:
Some relevant issues here are:
Note federation has a very specific meaning and implies no overlap between regions with distinct ownership. If you have to send your queries to different databases and reassemble results then that is a different distribution problem. Federation is a specific solution that revolves around knowing where data is located so duplication is not a problem. Any duplicate data is simply cache data. In the more general case, if you do not know where the data you want is, this is a much harder problem. Centralising data is a way of addressing this, or giving the impression of centralised data e.g. Google.
There are several different modes for distribution:
There are a couple of distinctions between using a system like SIMILE through a web browser and using it through a client like Haystack:
There is an open question here about whether SIMILE should consider distributed resources as they create a whole different scope and open up a host of extra problems. One way of avoiding this would be to devise a simple block-transfer protocol that supports getting all the metadata to a single location and dealing with it there. Since in the centralized scenario everything is hard, perhaps it is better to defer distributed search?
There is some disagreement about whether SIMILE a client server application or whether it should be a service or system that publishes information to the Semantic Web, without necessarily providing ways for the user to interact with that information. This is because the Semantic Web is not a number of closed worlds so information from SIMILE will be reused by other systems, whether portals, client-end applications or something else. This leads to questions like how does SIMILE use Semantic Web information from elsewhere such as dynamic information? The SIMILE client could do such integration or it could be made possible as part of the service architecture within SIMILE. SIMILE could choose just to be a ``leaf node'' in the Semantic Web providing information but not consuming it from other Semantic Web sources. This would still be valuable and should be the focus of initial demonstrators but in the long term it limits the ability to evolve as the way we use information changes.
SIMILE is a semantic web platform.
Significant area for real systems
Different of scaling on one system from scaling the SIMILE federation.
Not distribution
Recommend one or more separate documents from the client side and domain
side, including user experience.
There is an issue of vocabulary design/building/reusing. This is self-contained.
A key issue for dealing with semi-structured metadata stores using disparate, evolving ontologies is how to support end-user navigation and interaction in an intuitive way.
When users are creating metadata, generally there are existing ontologies that describe at least a subset of their problem domain. Therefore, one issue is how to assist users in discovering suitable ontologies for marking up metadata. One difficulty here is the complexity of some of the schema or ontology languages. Another difficulty is that are generally located at disparate locations, and sometime are competing with one another i.e. they address similiar domains. Therefore, it may be desirable to have a repository of schemas, that presents them to the user in a simplified way in order to help the user select a suitable schema. For example it may hide the syntax used to express the schema, such as RDFS or OWL from the user and instead present the schema graphically.
Applying complex classification schemes on resources could negatively impact users' ability to search for resources. It is important to hide unnecessary detailS until userS need it. This may be done in several ways:
There are many tasks that may involve unnecessary repetitions, for example:
Users may like to receive guidance in a number of tasks during different stages of the information lifecycle. One solution is to use discovery aids as outlined above. Another way is to use techniques such as wizards that guide users step-by-step through complex tasks.
Electronic retailers like Amazon use recommendation systems to assist users and to guide them to resources that they may be interested in. These systems work by analyzing what resources users search for (and in the case of Amazon, purchase), looking for similarities with other uses and then making recommendations based on the items other users have searched for.
There are some limitations with the current versions of such systems. Most notably they have no way for a user to denote the context for their search: therefore on Amazon a user may search for very different items if they are purchasing an item for a relative compared to when they are purchasing items for themselves. Therefore making recommendations based on the entire users history may not be as effective as making recommendations based on recent search terms from the user. Also there are potential privacy issues that need to be addressed when recording user behavior, whether it is occurring with or without their knowledge.