Re: what would change for me?

On 24/10/2007, Jonathan Rees <jar@creativecommons.org> wrote:
> On Oct 21, 2007, at 7:44 PM, Peter Ansell wrote:
<snip>
> After I do this, I
> would be grateful if you would recast your message above as proposed
> recommendations (what you would do if others would) that meet the
> document requirements.
>
> Best
> Jonathan
>


As requested, I have attempted to reform the gist of my statements
into recommendations:

These don't quite map one-to-one onto the requirements, but they do
attempt to reconcile the difficulties which made the issue necessary
to discuss:

As a preamble to this, I should say that the science and medical
information being referred to is assumed to be realistic, although not
guaranteed to be accurate or complete, so there will always be a
certain amount of ambiguity between a mathematical style logic proof,
and the results of analysing a changing knowledge base. So while
proofs, and replicability within certain time boundaries are
necessary, these should not hinder improvements to the way scientists
or doctors will actually use the system.

1. Resources should be both malleable within organisational boundaries
and subject to a versioning contract based possibly on one or more of
the following conditions:
   # some known aspect of the metadata document, eg dc:version
   # resource identifier which is used to identify the metadata
document eg. lsid:version, NCBI accession numbers
   # Other widely known contract, such as a database version in use
which was used to derive the data
2. The controlling organisation for a resource should be identifiable
in some way from the resource identifier in order to reflect the fact
that organisations control and are responsible for the data they
provide, although the relationship of the given resource to resources
from other organisation need not be as visible due to the fact that
this would cross organisational boundaries
   # This enables a user to decide which organisations to include or
give priority to in their use of information, without specifically
having to retrieve metadata documents in order to determine this
aspect of their query
3. Multiple controlling organisations may have their data and metadata
in a common store resulting in a possible ambiguity between stores and
organisations. Hence, the resource identifier should allow for one or
more of the following:
   # Character based transformations of identifying strings at
resolution time, without changing the meaning or nature of the
original identifier are acceptable as they are under the control of
the user essentially and therefore, the controlling organisation for
the resource is not responsible for their independent decision in this
case.
   # A known resolver service which by some routing mechanism provides
access to a store which contains the resource metadata as a minimum,
although the data may also be accessible through the same resolver
service in a different, possibly local, store
   # Queries may resolve to multiple totally equivalent stores, which
may then also be accessed using local caches in order to make the
distributed knowledge system usable given the demands of interactive
users for reasonably fast access to data
4. Essentially, one always wants to know what they are getting before
they buy it. In other words, one will always want to know what the
characteristics of an item are before they expend resources in order
to gain access to it. Resources, represented in some way by their
identifiers should have associated data which informs a potential
consumer of the nature of the resource. Given that essential economic
principle, a user is more likely to want the default information
retrieved to contain metadata. The default identifier for the
resource, ie, the resource at least containing the metadata, and
possibly data, is what is identified in the metadata of other
resources when referring to the resource.
5. The retrieval of a metadata document should not discriminate
between publically available and privately available metadata. If
private metadata is needed, it should be referenced on the public
metadata document through a unique identifier which is not used as the
identifier on other metadata documents to reference the resource.
6. The bits representing the data source for a resource does not need
a publically available URI, although this would encourage reuse and
referencing of resources which are known to be equivalent. Users
should however provide references to the location of the data in the
metadata document, possibly using non-URI like strings which indicate
either a manual process of negotiation for access through a different
mechanism, such as a SOAP web service or snail-mail etc. Knowledge of
a specific access point to a store which contains the original data
may require manual negotiation, ie, data on a GRID based distributed
data store may first require authentication and possibly a
contribution towards the costs of the system.
7. The aim of having a systematic method of identification for
resources is to create relationships between them, hence, resource
metadata will inevitably contain references which indicate how one
resource relates to another. Other organisations may augment these
relationships with their own statements about the way their data
relates to the resource as it was originally published by the
controlling organisation. For example, these statements may take the
form of rdfs:seeAlso as an RDF predicate for low level relationships,
or they may extend to owl:sameAs for relationships which specify the
actual resources to be identical, with simply the organisations which
are making the descriptions as different. This separation, although
possibly not ideal as it requires more negotiation to determine the
characteristics of the resource, are necessary in order to provide a
determined mechanism for extending and building on the current
knowledge source.
8. The methodology for resolving broken references, while regrettable,
is not consistently guaranteed within the scope of a system which
presumes the existence of, and access to, the metadata using a chain
of methods. By chain of methods, we mean a process such as the
following:
   # Look at the resource identifier and attempt to load the metadata
(and/or data) from a local cache. The methods for performing this part
may include hash values or other types of indexing optimisations which
are performed at the risk of the user. (Fastest)
   # The only guaranteed way to retrieve accurate metadata is to
source it from the currently existing controlling organisation, as
indicated in the identifier.
   # Have a human attempt to determine what the controlling
organisation may be at the current time and attempt to retrieve it
using some mechanism, which may be emailing the domain authority for
the new organisation, and given these mechanisms are not perfect,
there is no guarantee that access to the data will be provided
9. The entire set of human information is constantly evolving and
although reality can be assumed not to change in response to our
assertions about it, our classifications and knowledge about it will.
Hence, there are no guarantees possible about the consistency and the
set of information elements included in a metadata document. A
metadata document exists to, one, document the essential details of
the resource in question, two, to detail the method which was used to
record the details about the resource, and three, to detail the
relationship of the resource, or the metadata document, to other
resources or other metadata documents which are related to the
resource in question. There will always need to be a distinct level of
knowledge based on the actual metadata details and published
specifications as to what the relationship is. For example, a resource
to resource would take the form of the following RDF statement, where
the meaning is known based on the published standard (uri, predicate,
uri). A resource to metadata document also about the resource
relationship may take the form of (predicate IS owl:sameAs), where the
person interpreting the resource would know to aggregate the two
metadata documents as being based on a single resource based on the
existence of this assertion.

The semantics of versioning the degree of knowledge which we have
about a resource should also be determined at this level, likely in
the rdf case using the Dublin Core versioning vocabulary. There is no
necessity to push the issue of semantics of versioning into the
general area of identifier resolution, although if a major
breakthrough is made where the data changes, the controlling
organisation would find it necessary to create an alternative
identifer for the resource to accommodate people who used the
previously known data. If another organisation were to publish
alternative data about what is essentially the same resource, but
which has been observed or detailed from a different perspective, than
their "version" should utilise a unique identifier reflecting their
organisations input into the process. Any future publication and
peer-acceptance of their version will inevitably change the way third
parties utilise the data in their own research, so they should have
specific knowledge of the transfer, possibly using interrelationships
between metadata documents, such as supercededBy etc. The essential
aspect of this change is that it is not necessary for the system to be
restricted by past uses of the information inside it, giving another
reason for the existence of constantly changing metadata documents,
reflecting changes in knowledge, without having to always reflect
these changes in the identifier used to reference the metadata
document.


END

As is typical of my style, what I said is overly verbose and needs
editing and clarification, but it fits what I see as the issues so
far, without restricting a user to a particular mechanism.

Cheers,

Peter

Received on Thursday, 25 October 2007 02:12:03 UTC