PI Meeting 9-Apr-03

Present: Mick, MacKenzie, Eric, Rob, Mark (video), Kevin (video), John (phone)


MacKenzie:  Web site ingest: interesting, not necessarily in SIMILE scope

Mark: Is OCW an instance of that use case?

MacKenzie:  More about the OCW metadata; wants to archive OCW Web sites but not necessarily part of SIMILE use case

Using content management system for Web sites, also metadata (IMS etc) for learning objects.  DSpace will need to support IMS when it comes to archiving OCW site.

Eric:  Willing to put this lower on the priority list

Mick:  It's interesting because it forces you to address the various types of metadata involved in making a Web site preservable; also covers application interface side of things.  Obtaining + storing the metadata, also dissemination architecture.

Eric: Don't want to lose this use case, but we do need to prioritise, and this doesn't seem like a top priority.

Can break Web site use case into smaller parts.

Mark: Type and class issue?

Kevin: Type, how do you interpret assets, naming=how do you interpret metadata?

Mark:  Type = media type?

Mick:  One kind of type = implies binding between bits and intepretation/application for viewing.  Second = schema definition.  Third = set of rules to infer some sort of class.

Kevin: Class defines what needs to be in subgraph.  Maybe format is better word than type.  type is a link from an asset to some implementation

MacKenzie: Media type doesn't always determine behaviour.  e.g. lots of things will be encoded in XML or RDF, behaviour will be on what that XML or RDF represents

Kevin: Type means more than just MIME type

Mick: Move towards glossary?

Mick displays the 3 types of er, type

Eric forwards definition of type and format

Mark: Genre seems a subjective categorisation, class an object categorisation

MacKenzie: It is subjective, about how the formats are interpreted

Kevin: My definition of type also includes interpretation.

Mick: there do seem to be four concepts; we need a glossary and specific terms

Eric: Now I understand Kevin's terminology, I'm happy to drop this for now

Rob: Suggest Mick write up proposal for glossary for email discussion

Eric: App developers do understand 'blah is an instance of type X'.  They might not understand things of multiple types.

Kevin: If type is subjective not objective, does that make a difference?  The fact that something is in a type is objective; putting something into a type might be the subjective part of the process

Mark: 3 axes: subjective + objective; metadata + original object; explicit or implicit

Mick: Let's go along with Rob's suggestion; leave it for now.

Eric: This isn't really relevant to the actual use case document; more ubiquitous an issue than that.  Maybe this belongs in 'lessons learned' document.

Mick: On to distribution.

Rob: Perhaps a section in the 'motivating problem' section enumerating the various possibilities of distribution (and technical architecture)

Kevin: I am smart vs thin client agnostic--informed by use case

Mick: Architectural decisions should not preclude either

Kevin: Designing for both might make more work

Eric: Would like to be as close to Web architecture as possible.  There are separate issues here; how the system manages its metadata internally, and a dissemination mechanism.

Kevin: Genesis is just a store, doesn't have the application part of SIMILE.  If you use Genesis you will just be able to do trusted peer level distributions, or anonymous read-only distribution.  Not suitable for a client API.

Rob: Two conversations going on:  How distribution occurs, where functionality for the users resides.

Mick: Three!  Transport mechanisms, higher-level distribution strategy, richness of presentation.  Is there some reason that we have to chose one particular approach to smart/thin client now?

MacKenzie: To early to have this conversation?  smart/thin client depends on use case.

Kevin: David set it would take a month to get Haystack to be a back-end accessible by Web UI.  But this would lose some of the rich functionality.

(David joins)

David: These three things are separate.

Mick: David started with 'we need to be careful about distribution'.

David: Maybe we can use the word 'federation'.  We don't even know what to do when we have all the information in front of us.  To talk about distributed queries is adding extra complexity too soon.

Kevin:  Federation has a specific meaning.  Federation implies no overlap between regions with distinct ownership.

David:  We might also wish to simply pretend that three databases are one for some purpose.

Kevin:  If you have to send your queries to different databases and reassemble results that's a different distribution problem.  Federation is a specific solution that revolves around knowing where data is located--duplication etc. not a problem.

David:  Word agnostic.

Kevin:  Federation distribution is fairly simple.  Any duplicate data is simply cache data.  In the more general case, if you don't know where the data you want is, this is a much harder problem.  Centralising data is a way of addressing this.

David:  Many solutions simulate having all the data in one place, which separates distribution problem from the problem of actually using the metadata.

Mick:  Distributed data sources will be a big problem, and a postulated strength of RDF is distribution.  First let's build systems with data in all one place.  Then we choose one method of distribution from Rob's proposed list of mechanisms for a use case we've already addressed; take an incremental approach.

Eric:  This is a problem that SIMILE really needs to tackle.

Recap of Website ingest conversation

Mick: OK, change topic.  Too many use cases.  Perhaps make a matrix, use cases on one axis and aspects on the other, and then choose which use cases best represent our priorities.  (Shows two mind maps)

MacKenzie: Some use cases fix together nicely; maybe should do that first.  The ordering could be better.  Citation extraction not really a serious proposal.

David:  Citation extraction has two parts; extracting the metadata, and using that metadata once you have it.  SIMILE seems more about the latter.

Eric: MEDLINE is annotation that's machine derived as opposed to human derived.  Maybe rename it?

MEDLINE maybe close to annotations as content?

David: 'Mining' to me is about extracting metadata, not necessarily analysing metadata.  Define 'annotation'?  Means humans adding notes to me.  Mining is a special case of annotation since it's metadata generated by some computational analysis

Eric/MacKenzie: To us annotation means just extending existing metadata

MacKenzie: OCLC authority control use case is annotation.

David: Mining should be 'mining from unstructured information'

MacKenzie:  OCW, visual images + biomedical images are closely related and under 'multiple schemas'

Kevin: Isn't dissemination under 'distribution'?  Biomedical images under 'distribution'?

MacKenzie: Simple use case; image metadata doesn't come from other systems, users simply want to enter metadata that isn't Dublin Core.  OCW is not a single use case; it consists of many.  'IMS support' is really the use case.

David: What comes under the scope of 'distribution'?  Distributed aspect to this but we don't want to tackle the whole problem.

Kevin: We are trying to work out what the distribution requirements are from the use cases.

MacKenzie: Let's call the use case 'learning object support' instead of OCW

Mick: Not concerned with 'round tripping'? e.g. paper in DSpace -> used in class -> OCW -> back to DSpace with more metadata

MacKenzie: It's de-duplification really; not big problem.  SIMILE more about semantic use of metadata.

Mick: Make it safe to think about all aspects of the problem

MacKenzie; two use cases; one is multiple schemas, one is the open courseware process which includes distribution.  first is priority for me.

Eric: There is multiple schema support, and evolution.


ACTIONS

* Mick to propose terminology and the sorts of type
* Different ways of creating illusion that all the data is one place (achieving distribution) - ALL