Minutes from PI meeting 9th April

Hi Folks

Summary of main points

- The team tried to split the use cases up into four categories:
a) annotations: metadata generation & attachment
b) multiple schemas
c) distribution
d) dissemination
The categories are the motivating problems, so this would provide a matrix
which indicated which use cases touch which motivating problems. 

- David Karger pointed out distribution was not particularly useful as a
category, as all use cases involved an element of distribution.

- The team discussed the problem of aggregating metadata from various
sources, in particular whether it was possible to bound or simplify the
problem in order to construct a constrained prototype. 

- There was also some discussion about the importance of provenance of
digital information, and its relationship to the history system. 

- There was some discussion about the use cases, and the need to convert
existing corpora of metadata to RDF.

- Eric Miller highlighted the need for a technology that helps navigate RDF
and creates UIs to allow the user to edit and visualize RDF. Particular
problems include how to navigate over very large vocabularies. 

Transcript:

USE CASE PRIORIZATION

Mick Bass: Use case prioritization. What we want to do is some prototypes
that work on their own, then we connect that together, and then we can start
to  think about architectures. 

I want to drive forward to a prioritization here. 

Kevin Smathers: Some candidates here include the history system we talked
about, where history can be shared amongst the sites

David Karger: once we start about sharing amongst sites, 

Kevin Smathers: It requires a different type of distribution from the other
use cases. The only thing you can't solve publish and ingest is where you
have  shared ownership.

Eric Miller: I would view ownership as more metadata.

Mick Bass: The question is which use cases have aspects of these various
sorts of things.

David Karger: Well any use case has aspects of distribution if we want it
to. That makes the use of distribution as a category as useless.

Mick Bass: So if we want to invest in distribution, it will not be difficult
to do so. 

Eric Miller: It strikes me that, like the other arcs, annotation is arc and
system usage data is the use case. 

Mick Bass: I want to talk about each use case that we haven't discussed. I
think there has been a lot of discussion around augmentation vs extraction
etc.  Some of the use cases center on annotations. We've talked out website
ingest already. 

Eric Miller: This is the rdf-diff idea, that I think would be a killer.

David Karger: I think we want to let people in the library categorise
metadata

MacKenzie Smith: It's called OAI harvesting.

Eric Miller: Grumble.

MacKenzie Smith: Being able to define virtual collections that are a mix of
local and external resources. 

Eric Miller: The sending of differences across the wire makes a lot of
sense. 

David Karger: We could call this external references, it could just be we
are referring to something somewhere else.

MacKenzie Smith: Its not that you want it to be a proxy for information
management, you need it to be a proxy for information retrieval. I don't
want to  manage the object, but I do want my users to be able to find out
about it.

Mick Bass: Maybe lower priority 

MacKenzie Smith: Have you figured out which of the motivating problems it
might be?

Mick Bass: So is dissemination. I've got this metadata about the thing, I
want to present it in a way that is presentable. What are the constraints
for it  to be part of this collection. 

Eric Miller: I think that may be a constraint we can't put up with. 

David Karger: What is the downside of things being external references?

Eric Miller: Link integrity. Persistance.

David Karger: yes but having the name doesn't mean I can get the named
object.

Eric Miller: yes but at the end of the day we need to provide compelling
apps to the user. That said the web is decentralized.

Mick Bass: So if I want to provide this virtual collection, with a minimum
quality of service, 

Kevin Smathers: The lowest possible bar is where google is. 

Eric Miller: Yes it provides a cache also. This was an add-on due to the
frustration that people expressed because they couldn't find stuff. 

Eric Miller: Do you consider history to be use metadata?

Mick Bass: The history is one form of annotation. 

MacKenzie Smith: The provenance thing that people need for authentication
over time. 

Kevin Smathers: It also comes under distribution due to the scenario we were
discussing.

Mick Bass: The mindmap is a first cut as a matrix approach.

David Karger: My inclination would be to pick something from each (EM
agrees).

MacKenzie Smith: For multiple schemas we need two at least.

Mick Bass: The history system is tracking two things. 

Kevin Smathers: Question: is this the image you have created now, is this
substantially the same as the motivating problems diagram. 

It doesn't have use cases attached to it.

Mick Bass: Some one of the pieces of feedback was relating the two. 

David Karger: What I'm not seeing in these categories is how are people
going to make use of this information. 

It's the case that in many of the use case we have discussed there is a
human interaction element. 

Eric Miller: We should put a note to that effect, and use it in the
prioritization of use cases.

Eric Miller: my other question, to MacKenzie, some of this stuff is only
going to be exciting if we have large corpus' of content. Do you have a
sense of  how much?

MacKenzie Smith: It depends on the timescale. 

Eric Miller: Next three months.

MacKenzie Smith: OCW have just started their workflow to do 500 courses for
the fall. There are about 50 learning objects to the course. For the VRA
stuff  we have images, but not the descriptive data. 

Eric Miller: My preference is on the image side of things. Have more visual
impact. The learning objects are like interesting, they are like museum
records.

David Karger: Quantity of records is not the issue, its about of having more
than one schema.

Eric Miller: I want both. I'm trying to get a sense of what kind of
collections we are talking about.

MacKenzie Smith: We don't have user interfaces to create the metadata in
some of these use cases. 

Eric Miller: They can export them in RDF. 

MacKenzie Smith: If we need test data, we can get it. 

Eric Miller: I want to see some structure on what the next six months of
development we will do. 

Mick Bass: one of the reasons I put schema registration, vocabulary and
search is its a prerequisite for some of these other use cases.

Eric Miller: Can I ask what we mean by multiple schemas?

David Karger: The easy bit is different communities to use different
schemas. The harder bit is interoperating between different schemas. 

David Karger: Theres a value for the library to support a specific schema,
but thats separate from being able to search across schemas. 

MacKenzie Smith: So you're saying that just developing that core
infrastructure would take awhile, and enable these other use cases.

Mick Bass: The question is how do I search schemas, navigate schemas etc.

Eric Miller: Lots of people are doing this already. SIMILE could leverage
those. Its a really low hanging fruit.

MacKenzie Smith: Right now, the need is to support schemas like VRA. Thats
important, its not as important as creating new schemas. Its about asking
does  the schema exist in the system, show me an editor that lets us enter
stuff into the schema. The registries I've seen are different to that,
they're like  lego - you put together schemas from other schemas.

Eric Miller: If we qualify schemas by putting RDF in front of it, then they
support the ability to mix and match things. But VRA is a DTD, not even an
XML  schema. Some of these other schemas are HTML documents that are comma
delimited files. All these approaches have different ways of representing.
What we  need is representations of these schemas in an RDF Schema language,
and some tools for the next people to come down the pipe to use these
things.

David Karger: This is the next layer. We want a way for someone in the
library to install a schema, so it was just represented. 

Eric Miller: I can name six different tools that tries to solve this
problem. (To MK) the problem you describe is the much harder problem. 

David Karger: Having the ability to navigate schema is in itself not useful.
We need to make it work too.

Eric Miller: Don't we want to see the schema through this view.

David Karger: The easier challenge is I want to put stuff through this 

Eric Miller: Theres only one system that does that and they went out of
business.

Mick Bass: I agree with the prioritization that being able to surf the
schemas is different to surf the instance data. So for the system to address
"here's  my schema, here's the instance data, show it to me". So the system
needs to do some of the things under schema registration.

I want to make a plug. I'd like to talk about the history system, although
I'm not sure it should be called the history system. The provenance. 

David Karger: The history of the data, or the history of the use of the
system.

Mick Bass: Thats a policy question, not a technology question. The plug is
we should take up the history system, as we edit instances we should reflect
those edits in the repository. 

David Karger: This is provenance for the content.

MacKenzie Smith: So its not any different to now.

Mick Bass: The part that needs some work here is how you visualise it, how
you navigate it, how you annotate it. 

MacKenzie Smith: Isn't this just a schema 

Eric Miller: We ended up using our own vocabulary.

David Karger: This is related to the thin client / fat client that we've
discussed on email. 

Mick Bass: So I'll return to this proposal for prioritization. 
Support for multiple of schemas, supports the first two use cases
Present the schema
Allow editing of instances
Allow search within that schema
Allow dissemination of instances

History system
Reflect the new instance data in the history system
Navigate and query from that respect
Navigate and query schemas (schema registry)

MacKenzie Smith: A lot of people want to know what the history of an item
is. Its really important for archivists. We need some way of seeing that
data.  This comes up all the time, its necessary to check that something was
deposited. 

DK: When you have things that are URIs. 

MacKenzie Smith: Does this mean everytime you get something new, you get a
new URI?

Eric Miller: One policy we've advocated is giving new URIs. 

MacKenzie Smith: So we have to constitute a change. Whatever the schema is
for the history system, its just another schema.

DK: Yes but this is a complex hybrid of the other problems we have, due to
multiple vocabularies, multiple versions. To attack that early on is too
had.

Eric Miller: Really, is that the case.

DK: To do this, we make a transformation on the thing. We can't give it the
URI of the resource, you give it a different one. 

JE: What I hear MK being concerned about is people need to here the version
history of the resource.

MacKenzie Smith: We also have a problem where people accidentally corrupt
the metadata, so we need an audit trail over time.

DK: Yes but schemas change over time, how does that get represented. I think
this is an advanced form of the problem.

Mick Bass: so if the general case is solve provenance, then how do you tease
that apart into some things you can actually tackle. 

MacKenzie Smith: There are some policy issues we need to do urgently to be
able to do this.

Rob Tansley: Yes, but it could take us a while to build a demonstrator here.
It would take longer than the other use cases.

Mick Bass: Yes but there are lots of ways to pragmatically down scope it. 

? : Well once Jason has cleaned up the metadata, we can show a way of
navigating around that. 

Eric Miller: So that's almost done. A while ago, I got some of this DSpace
history data. Isn't this in RDF?

Mick Bass: Yes but a broken schema.

Eric Miller: Yes but once its in RDF, we can use some RDF navigation tools.
So navigating around those is pretty straight forward. Doing customised
views  will take a little longer. 

Mick Bass: Yes but there are some use cases that will come up e.g. my schema
changed, which instances are effected etc.

?: Yes thats not really an audit trail.

MacKenzie Smith: If I need to, I can come up with some real scenarios of how
to do this.

Mick Bass: In the RDBMS community

MacKenzie Smith: There's a journal - its the same sort of thing. We need to
be able to do this in a legal way.

Rob Tansley: It sounds like we need some more information about the use case
here.

Mick Bass: so we need more crisply articulated examples of the use of
history data. 

MacKenzie Smith: We have a lot more work to do on all of the use cases. So
where I would like to put effort is in the IMS / VRA use cases. 

Kevin Smathers: Have you asked MK about the shared catalogue idea is
realistic?

The basic idea is its a fair amount of work to maintain catalogue
information for the assets you store in DSpace. As assets become more
common, it may be  more widely distributed. In that case it is useful to
share the cataloguing information back. 

History is primarily a distributed use case. 

MacKenzie Smith: In our world there is always of a copy of record. Their
data is the official metadata. Other people can ammend it and give it back.
I think  what you are saying is if two groups make changes, we need to
aggregate the changes. there is always one copy that is the official master
copy.

Kevin Smathers: So should there be ways of getting updates back to the root
copy?

MacKenzie Smith: Usually that is less of a concern that me being able to
cascade updates to other people who are using that information.

Kevin Smathers: In any case, where you would notice there had been changes
in provenance would be in the instance data. Importing data would allow
librarians who are interested to add metadata to their own collections.

MacKenzie Smith: I can't think of a use case where we want to do that.

Eric Miller: You've pretty much described OCLC business model for the last
30 years.

Mick Bass: In a narrow domain e.g. MARC.

MacKenzie Smith: OCLC is the record of record. If I make a change, I don't
give it back to OCLC.

Eric Miller: Its a push model not a pull model. Certainly bioinformatics is
a community that shares data like that, and requires those updates.

Rob Tansley: We need to turn the history system into a use case rather than
the name for a part of the system.


DISCUSSION OF THE PROTOTYPE DOCUMENT

Mick Bass: We want to review this, try to come up with some additional
prototypes.

Eric Miller: one of the decomposed components might be an RDF debugger i.e.
loading a graph, and at any point jump on a node and view the properties on
that  node i.e. a navigating a piece of RDF. Part of this is just taking
them and putting them in DSpace. The next thing is being able to extend
those, and have  relationships added i.e. edit the graph as you go. Some
basic level infrastructure that will be helpful for adding, editing and
tuning the data. These are  generic components. 

We also need tools for scraping i.e. whatever we choose we need to do some
transformations e.g. Perl or XSLT. 

Kevin Smathers: Which parts do you want to transform?

Eric Miller: I'm talking about ingest really. We are going to be naive if we
think all the data will be in standard format. 

Kevin Smathers: We need a strong design for the ingest system.

Eric Miller: Correct. I see that as base level for the structure we've just
described.

Kevin Smathers: I wouldn't think you want to do things ad-hoc. Its such an
important part of the problem.

Mick Bass: This is reinforced by the first activity Mark proposed.

Eric Miller: Once we look at the content, we look at the kind of
transformations needed. The reality is even if you do run into a situation
that a community  has used XML, 90% of those groups had invalid XML.
Unfortunate things like that cause problems. We always get poor data. So its
partly data manipulation,  partly transformation. We are talking about
reverse engineering the semantics behind the structure. 

Mick Bass: These are tools to faciliate the creation of instances from
legacy data. 

Eric Miller:As much as I'd like to see a formal set of tools, I don't think
its workable. All the use cases talk about getting data. There is a chicken
and  egg problem.

Rob Tansley: I think it is a big feasibility study for RDF semantic web
techniques as a whole. So how you get the data from other form into RDF is
important. 

Kevin Smathers: Its not generally solvable, but you might be able to make a
tool that helped people. 

Eric Miller: The other thing is a vocabulary is way of rendering the content
in a schema. What we want to do is annotate the schema with rendering
characteristics. 

Kevin Smathers: So what is Haystack?

Eric Miller: If you extract Haystack from the client, then it solves that
piece?

MacKenzie Smith: There are more generalised versions of that problem.

Eric Miller: Haystack has the potential to do that.

Mark Butler: This is a general problem - we want declarative MVC approaches.
Other examples include XUL and XForms.

Eric Miller: I'd through RDFlib and Redfoot into the mix as well. They try
to render to the web. The guy who has done that has done a great job of
rendering  RDF to the web. Another one he's done is RDFSchema.info, that
harvests schemas, talks about the popularity of properties in the instance
data. This is just  an example of how you can use RDF to annotate schemas. 

Kevin Smathers: Where's the syndicated data on Redfoot.net? What's the other
URL?

Eric Miller: rdfschema.info. I'll take an action item to put out a bit more
information on this, on rendering RDF based on schemas and instance data.

So generic components here are a scalable RDFStore, RDFQuery, basically a
lot of what Jena provides e.g. graph APIs. Without going into a specific use
case  those are the general things that jump out at me. Once we have general
functional requirements. For me leveraging Jena, Joseki etc. I'm assuming we
are not  going to spend lots of time evaluating others, we are just going to
use those.

Mick Bass: Combining those in interesting ways, i.e. getting a schema and
providing an instance editor for it.

Eric Miller: My thoughts is we are going to create an editing vocabulary and
tailor that vocabulary to produce a variety of interfaces to the user. So if
we  have a title property, we might associate it with a literal string form.
So constaints on elements are necessary ones that are done at the schema
level. So  you still need that bit of information that says how do you move
from the schema declaration to how you present to the user.

Mick Bass: MK, in VRA and IMS, will we need to have these mechanisms?

MacKenzie Smith: You don't have to, but we would recommend you have those. I
disagree with David's comments on instance validation. We may want to
override  it, but we want validation.

Mick Bass: There is an interesting prototype demonstrator here. There is
probably a demonstator that says show me a selector that says show me a
selector  that allows me to chose one or more values from a controlled
vocabulary. This uses annotations to generate a UI. 

Eric Miller: This is what we did with Cork. There is a finite set of
actions.

Mick Bass: So Getty has 0.25 million terms. How do you pick one?

Eric Miller: From the cork standpoint, you got a search interface and
dragged it back. 

MacKenzie Smith: Eventually we will have a local version of the AAT. 

Kevin Smathers: There is one very large open database that is an authority
file for DMOZ.

Eric Miller: DMOZ is running with one person, and a machine that is about to
die. So what I want for any topic is that is to go broader or narrower on
that. 
Mick Bass: There are two paths: something might be local, or it might be
remote and how you interface to it.

MacKenzie Smith: You can license this whole thing, it costs us money but we
can do it.

Eric Miller: All of the RDF-debugger, I don't think I've seen one written in
Jena. 

Mark Butler: What about BrownSauce?

Eric Miller: Yes, that's right, that must be built on Jena. 

Mark Butler: I have a reservation here. What you seem to be talking about is
very much like XForms, or rather RDF-Forms. I worry that it is dangerous to
try  reinvent things, we would be better leveraging them. For many use
cases, RDF/XML and XML are effectively two competing formats for data. We
shouldn't need  to reinvent all the tools that have been created for XML for
RDF, we should re-use them. I know thats hard to do at the moment, but I
think it is possible to fix that. The XForms folks have done a lot of good
work, we should use it. 

Eric Miller: Well maybe you know better than me here Mark, but my
understanding was that XForms is not being used.

Mark Butler: I don't know. I think it is being used, and there is interest
in it, but people are implementing subsets of the spec. There was a lot of
interest in the mobile community in using it, but I know some people were
concerned about needing XML Schema processors. They thought this was too
much for mobile devices. However, if we layer RDF on-top, at least in its
current form, footprint problems get worse. 

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Friday, 11 April 2003 08:41:32 UTC