- From: Peter Ansell <ansell.peter@gmail.com>
- Date: Thu, 25 Oct 2007 12:11:52 +1000
- To: "Jonathan Rees" <jar@creativecommons.org>
- Cc: public-semweb-lifesci@w3.org, p.roe@qut.edu.au, j.hogan@qut.edu.au
On 24/10/2007, Jonathan Rees <jar@creativecommons.org> wrote: > On Oct 21, 2007, at 7:44 PM, Peter Ansell wrote: <snip> > After I do this, I > would be grateful if you would recast your message above as proposed > recommendations (what you would do if others would) that meet the > document requirements. > > Best > Jonathan > As requested, I have attempted to reform the gist of my statements into recommendations: These don't quite map one-to-one onto the requirements, but they do attempt to reconcile the difficulties which made the issue necessary to discuss: As a preamble to this, I should say that the science and medical information being referred to is assumed to be realistic, although not guaranteed to be accurate or complete, so there will always be a certain amount of ambiguity between a mathematical style logic proof, and the results of analysing a changing knowledge base. So while proofs, and replicability within certain time boundaries are necessary, these should not hinder improvements to the way scientists or doctors will actually use the system. 1. Resources should be both malleable within organisational boundaries and subject to a versioning contract based possibly on one or more of the following conditions: # some known aspect of the metadata document, eg dc:version # resource identifier which is used to identify the metadata document eg. lsid:version, NCBI accession numbers # Other widely known contract, such as a database version in use which was used to derive the data 2. The controlling organisation for a resource should be identifiable in some way from the resource identifier in order to reflect the fact that organisations control and are responsible for the data they provide, although the relationship of the given resource to resources from other organisation need not be as visible due to the fact that this would cross organisational boundaries # This enables a user to decide which organisations to include or give priority to in their use of information, without specifically having to retrieve metadata documents in order to determine this aspect of their query 3. Multiple controlling organisations may have their data and metadata in a common store resulting in a possible ambiguity between stores and organisations. Hence, the resource identifier should allow for one or more of the following: # Character based transformations of identifying strings at resolution time, without changing the meaning or nature of the original identifier are acceptable as they are under the control of the user essentially and therefore, the controlling organisation for the resource is not responsible for their independent decision in this case. # A known resolver service which by some routing mechanism provides access to a store which contains the resource metadata as a minimum, although the data may also be accessible through the same resolver service in a different, possibly local, store # Queries may resolve to multiple totally equivalent stores, which may then also be accessed using local caches in order to make the distributed knowledge system usable given the demands of interactive users for reasonably fast access to data 4. Essentially, one always wants to know what they are getting before they buy it. In other words, one will always want to know what the characteristics of an item are before they expend resources in order to gain access to it. Resources, represented in some way by their identifiers should have associated data which informs a potential consumer of the nature of the resource. Given that essential economic principle, a user is more likely to want the default information retrieved to contain metadata. The default identifier for the resource, ie, the resource at least containing the metadata, and possibly data, is what is identified in the metadata of other resources when referring to the resource. 5. The retrieval of a metadata document should not discriminate between publically available and privately available metadata. If private metadata is needed, it should be referenced on the public metadata document through a unique identifier which is not used as the identifier on other metadata documents to reference the resource. 6. The bits representing the data source for a resource does not need a publically available URI, although this would encourage reuse and referencing of resources which are known to be equivalent. Users should however provide references to the location of the data in the metadata document, possibly using non-URI like strings which indicate either a manual process of negotiation for access through a different mechanism, such as a SOAP web service or snail-mail etc. Knowledge of a specific access point to a store which contains the original data may require manual negotiation, ie, data on a GRID based distributed data store may first require authentication and possibly a contribution towards the costs of the system. 7. The aim of having a systematic method of identification for resources is to create relationships between them, hence, resource metadata will inevitably contain references which indicate how one resource relates to another. Other organisations may augment these relationships with their own statements about the way their data relates to the resource as it was originally published by the controlling organisation. For example, these statements may take the form of rdfs:seeAlso as an RDF predicate for low level relationships, or they may extend to owl:sameAs for relationships which specify the actual resources to be identical, with simply the organisations which are making the descriptions as different. This separation, although possibly not ideal as it requires more negotiation to determine the characteristics of the resource, are necessary in order to provide a determined mechanism for extending and building on the current knowledge source. 8. The methodology for resolving broken references, while regrettable, is not consistently guaranteed within the scope of a system which presumes the existence of, and access to, the metadata using a chain of methods. By chain of methods, we mean a process such as the following: # Look at the resource identifier and attempt to load the metadata (and/or data) from a local cache. The methods for performing this part may include hash values or other types of indexing optimisations which are performed at the risk of the user. (Fastest) # The only guaranteed way to retrieve accurate metadata is to source it from the currently existing controlling organisation, as indicated in the identifier. # Have a human attempt to determine what the controlling organisation may be at the current time and attempt to retrieve it using some mechanism, which may be emailing the domain authority for the new organisation, and given these mechanisms are not perfect, there is no guarantee that access to the data will be provided 9. The entire set of human information is constantly evolving and although reality can be assumed not to change in response to our assertions about it, our classifications and knowledge about it will. Hence, there are no guarantees possible about the consistency and the set of information elements included in a metadata document. A metadata document exists to, one, document the essential details of the resource in question, two, to detail the method which was used to record the details about the resource, and three, to detail the relationship of the resource, or the metadata document, to other resources or other metadata documents which are related to the resource in question. There will always need to be a distinct level of knowledge based on the actual metadata details and published specifications as to what the relationship is. For example, a resource to resource would take the form of the following RDF statement, where the meaning is known based on the published standard (uri, predicate, uri). A resource to metadata document also about the resource relationship may take the form of (predicate IS owl:sameAs), where the person interpreting the resource would know to aggregate the two metadata documents as being based on a single resource based on the existence of this assertion. The semantics of versioning the degree of knowledge which we have about a resource should also be determined at this level, likely in the rdf case using the Dublin Core versioning vocabulary. There is no necessity to push the issue of semantics of versioning into the general area of identifier resolution, although if a major breakthrough is made where the data changes, the controlling organisation would find it necessary to create an alternative identifer for the resource to accommodate people who used the previously known data. If another organisation were to publish alternative data about what is essentially the same resource, but which has been observed or detailed from a different perspective, than their "version" should utilise a unique identifier reflecting their organisations input into the process. Any future publication and peer-acceptance of their version will inevitably change the way third parties utilise the data in their own research, so they should have specific knowledge of the transfer, possibly using interrelationships between metadata documents, such as supercededBy etc. The essential aspect of this change is that it is not necessary for the system to be restricted by past uses of the information inside it, giving another reason for the existence of constantly changing metadata documents, reflecting changes in knowledge, without having to always reflect these changes in the identifier used to reference the metadata document. END As is typical of my style, what I said is overly verbose and needs editing and clarification, but it fits what I see as the issues so far, without restricting a user to a particular mechanism. Cheers, Peter
Received on Thursday, 25 October 2007 02:12:03 UTC