- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Fri, 11 Apr 2003 13:41:14 +0100
- To: www-rdf-dspace@w3.org
Hi Folks Summary of main points - The team tried to split the use cases up into four categories: a) annotations: metadata generation & attachment b) multiple schemas c) distribution d) dissemination The categories are the motivating problems, so this would provide a matrix which indicated which use cases touch which motivating problems. - David Karger pointed out distribution was not particularly useful as a category, as all use cases involved an element of distribution. - The team discussed the problem of aggregating metadata from various sources, in particular whether it was possible to bound or simplify the problem in order to construct a constrained prototype. - There was also some discussion about the importance of provenance of digital information, and its relationship to the history system. - There was some discussion about the use cases, and the need to convert existing corpora of metadata to RDF. - Eric Miller highlighted the need for a technology that helps navigate RDF and creates UIs to allow the user to edit and visualize RDF. Particular problems include how to navigate over very large vocabularies. Transcript: USE CASE PRIORIZATION Mick Bass: Use case prioritization. What we want to do is some prototypes that work on their own, then we connect that together, and then we can start to think about architectures. I want to drive forward to a prioritization here. Kevin Smathers: Some candidates here include the history system we talked about, where history can be shared amongst the sites David Karger: once we start about sharing amongst sites, Kevin Smathers: It requires a different type of distribution from the other use cases. The only thing you can't solve publish and ingest is where you have shared ownership. Eric Miller: I would view ownership as more metadata. Mick Bass: The question is which use cases have aspects of these various sorts of things. David Karger: Well any use case has aspects of distribution if we want it to. That makes the use of distribution as a category as useless. Mick Bass: So if we want to invest in distribution, it will not be difficult to do so. Eric Miller: It strikes me that, like the other arcs, annotation is arc and system usage data is the use case. Mick Bass: I want to talk about each use case that we haven't discussed. I think there has been a lot of discussion around augmentation vs extraction etc. Some of the use cases center on annotations. We've talked out website ingest already. Eric Miller: This is the rdf-diff idea, that I think would be a killer. David Karger: I think we want to let people in the library categorise metadata MacKenzie Smith: It's called OAI harvesting. Eric Miller: Grumble. MacKenzie Smith: Being able to define virtual collections that are a mix of local and external resources. Eric Miller: The sending of differences across the wire makes a lot of sense. David Karger: We could call this external references, it could just be we are referring to something somewhere else. MacKenzie Smith: Its not that you want it to be a proxy for information management, you need it to be a proxy for information retrieval. I don't want to manage the object, but I do want my users to be able to find out about it. Mick Bass: Maybe lower priority MacKenzie Smith: Have you figured out which of the motivating problems it might be? Mick Bass: So is dissemination. I've got this metadata about the thing, I want to present it in a way that is presentable. What are the constraints for it to be part of this collection. Eric Miller: I think that may be a constraint we can't put up with. David Karger: What is the downside of things being external references? Eric Miller: Link integrity. Persistance. David Karger: yes but having the name doesn't mean I can get the named object. Eric Miller: yes but at the end of the day we need to provide compelling apps to the user. That said the web is decentralized. Mick Bass: So if I want to provide this virtual collection, with a minimum quality of service, Kevin Smathers: The lowest possible bar is where google is. Eric Miller: Yes it provides a cache also. This was an add-on due to the frustration that people expressed because they couldn't find stuff. Eric Miller: Do you consider history to be use metadata? Mick Bass: The history is one form of annotation. MacKenzie Smith: The provenance thing that people need for authentication over time. Kevin Smathers: It also comes under distribution due to the scenario we were discussing. Mick Bass: The mindmap is a first cut as a matrix approach. David Karger: My inclination would be to pick something from each (EM agrees). MacKenzie Smith: For multiple schemas we need two at least. Mick Bass: The history system is tracking two things. Kevin Smathers: Question: is this the image you have created now, is this substantially the same as the motivating problems diagram. It doesn't have use cases attached to it. Mick Bass: Some one of the pieces of feedback was relating the two. David Karger: What I'm not seeing in these categories is how are people going to make use of this information. It's the case that in many of the use case we have discussed there is a human interaction element. Eric Miller: We should put a note to that effect, and use it in the prioritization of use cases. Eric Miller: my other question, to MacKenzie, some of this stuff is only going to be exciting if we have large corpus' of content. Do you have a sense of how much? MacKenzie Smith: It depends on the timescale. Eric Miller: Next three months. MacKenzie Smith: OCW have just started their workflow to do 500 courses for the fall. There are about 50 learning objects to the course. For the VRA stuff we have images, but not the descriptive data. Eric Miller: My preference is on the image side of things. Have more visual impact. The learning objects are like interesting, they are like museum records. David Karger: Quantity of records is not the issue, its about of having more than one schema. Eric Miller: I want both. I'm trying to get a sense of what kind of collections we are talking about. MacKenzie Smith: We don't have user interfaces to create the metadata in some of these use cases. Eric Miller: They can export them in RDF. MacKenzie Smith: If we need test data, we can get it. Eric Miller: I want to see some structure on what the next six months of development we will do. Mick Bass: one of the reasons I put schema registration, vocabulary and search is its a prerequisite for some of these other use cases. Eric Miller: Can I ask what we mean by multiple schemas? David Karger: The easy bit is different communities to use different schemas. The harder bit is interoperating between different schemas. David Karger: Theres a value for the library to support a specific schema, but thats separate from being able to search across schemas. MacKenzie Smith: So you're saying that just developing that core infrastructure would take awhile, and enable these other use cases. Mick Bass: The question is how do I search schemas, navigate schemas etc. Eric Miller: Lots of people are doing this already. SIMILE could leverage those. Its a really low hanging fruit. MacKenzie Smith: Right now, the need is to support schemas like VRA. Thats important, its not as important as creating new schemas. Its about asking does the schema exist in the system, show me an editor that lets us enter stuff into the schema. The registries I've seen are different to that, they're like lego - you put together schemas from other schemas. Eric Miller: If we qualify schemas by putting RDF in front of it, then they support the ability to mix and match things. But VRA is a DTD, not even an XML schema. Some of these other schemas are HTML documents that are comma delimited files. All these approaches have different ways of representing. What we need is representations of these schemas in an RDF Schema language, and some tools for the next people to come down the pipe to use these things. David Karger: This is the next layer. We want a way for someone in the library to install a schema, so it was just represented. Eric Miller: I can name six different tools that tries to solve this problem. (To MK) the problem you describe is the much harder problem. David Karger: Having the ability to navigate schema is in itself not useful. We need to make it work too. Eric Miller: Don't we want to see the schema through this view. David Karger: The easier challenge is I want to put stuff through this Eric Miller: Theres only one system that does that and they went out of business. Mick Bass: I agree with the prioritization that being able to surf the schemas is different to surf the instance data. So for the system to address "here's my schema, here's the instance data, show it to me". So the system needs to do some of the things under schema registration. I want to make a plug. I'd like to talk about the history system, although I'm not sure it should be called the history system. The provenance. David Karger: The history of the data, or the history of the use of the system. Mick Bass: Thats a policy question, not a technology question. The plug is we should take up the history system, as we edit instances we should reflect those edits in the repository. David Karger: This is provenance for the content. MacKenzie Smith: So its not any different to now. Mick Bass: The part that needs some work here is how you visualise it, how you navigate it, how you annotate it. MacKenzie Smith: Isn't this just a schema Eric Miller: We ended up using our own vocabulary. David Karger: This is related to the thin client / fat client that we've discussed on email. Mick Bass: So I'll return to this proposal for prioritization. Support for multiple of schemas, supports the first two use cases Present the schema Allow editing of instances Allow search within that schema Allow dissemination of instances History system Reflect the new instance data in the history system Navigate and query from that respect Navigate and query schemas (schema registry) MacKenzie Smith: A lot of people want to know what the history of an item is. Its really important for archivists. We need some way of seeing that data. This comes up all the time, its necessary to check that something was deposited. DK: When you have things that are URIs. MacKenzie Smith: Does this mean everytime you get something new, you get a new URI? Eric Miller: One policy we've advocated is giving new URIs. MacKenzie Smith: So we have to constitute a change. Whatever the schema is for the history system, its just another schema. DK: Yes but this is a complex hybrid of the other problems we have, due to multiple vocabularies, multiple versions. To attack that early on is too had. Eric Miller: Really, is that the case. DK: To do this, we make a transformation on the thing. We can't give it the URI of the resource, you give it a different one. JE: What I hear MK being concerned about is people need to here the version history of the resource. MacKenzie Smith: We also have a problem where people accidentally corrupt the metadata, so we need an audit trail over time. DK: Yes but schemas change over time, how does that get represented. I think this is an advanced form of the problem. Mick Bass: so if the general case is solve provenance, then how do you tease that apart into some things you can actually tackle. MacKenzie Smith: There are some policy issues we need to do urgently to be able to do this. Rob Tansley: Yes, but it could take us a while to build a demonstrator here. It would take longer than the other use cases. Mick Bass: Yes but there are lots of ways to pragmatically down scope it. ? : Well once Jason has cleaned up the metadata, we can show a way of navigating around that. Eric Miller: So that's almost done. A while ago, I got some of this DSpace history data. Isn't this in RDF? Mick Bass: Yes but a broken schema. Eric Miller: Yes but once its in RDF, we can use some RDF navigation tools. So navigating around those is pretty straight forward. Doing customised views will take a little longer. Mick Bass: Yes but there are some use cases that will come up e.g. my schema changed, which instances are effected etc. ?: Yes thats not really an audit trail. MacKenzie Smith: If I need to, I can come up with some real scenarios of how to do this. Mick Bass: In the RDBMS community MacKenzie Smith: There's a journal - its the same sort of thing. We need to be able to do this in a legal way. Rob Tansley: It sounds like we need some more information about the use case here. Mick Bass: so we need more crisply articulated examples of the use of history data. MacKenzie Smith: We have a lot more work to do on all of the use cases. So where I would like to put effort is in the IMS / VRA use cases. Kevin Smathers: Have you asked MK about the shared catalogue idea is realistic? The basic idea is its a fair amount of work to maintain catalogue information for the assets you store in DSpace. As assets become more common, it may be more widely distributed. In that case it is useful to share the cataloguing information back. History is primarily a distributed use case. MacKenzie Smith: In our world there is always of a copy of record. Their data is the official metadata. Other people can ammend it and give it back. I think what you are saying is if two groups make changes, we need to aggregate the changes. there is always one copy that is the official master copy. Kevin Smathers: So should there be ways of getting updates back to the root copy? MacKenzie Smith: Usually that is less of a concern that me being able to cascade updates to other people who are using that information. Kevin Smathers: In any case, where you would notice there had been changes in provenance would be in the instance data. Importing data would allow librarians who are interested to add metadata to their own collections. MacKenzie Smith: I can't think of a use case where we want to do that. Eric Miller: You've pretty much described OCLC business model for the last 30 years. Mick Bass: In a narrow domain e.g. MARC. MacKenzie Smith: OCLC is the record of record. If I make a change, I don't give it back to OCLC. Eric Miller: Its a push model not a pull model. Certainly bioinformatics is a community that shares data like that, and requires those updates. Rob Tansley: We need to turn the history system into a use case rather than the name for a part of the system. DISCUSSION OF THE PROTOTYPE DOCUMENT Mick Bass: We want to review this, try to come up with some additional prototypes. Eric Miller: one of the decomposed components might be an RDF debugger i.e. loading a graph, and at any point jump on a node and view the properties on that node i.e. a navigating a piece of RDF. Part of this is just taking them and putting them in DSpace. The next thing is being able to extend those, and have relationships added i.e. edit the graph as you go. Some basic level infrastructure that will be helpful for adding, editing and tuning the data. These are generic components. We also need tools for scraping i.e. whatever we choose we need to do some transformations e.g. Perl or XSLT. Kevin Smathers: Which parts do you want to transform? Eric Miller: I'm talking about ingest really. We are going to be naive if we think all the data will be in standard format. Kevin Smathers: We need a strong design for the ingest system. Eric Miller: Correct. I see that as base level for the structure we've just described. Kevin Smathers: I wouldn't think you want to do things ad-hoc. Its such an important part of the problem. Mick Bass: This is reinforced by the first activity Mark proposed. Eric Miller: Once we look at the content, we look at the kind of transformations needed. The reality is even if you do run into a situation that a community has used XML, 90% of those groups had invalid XML. Unfortunate things like that cause problems. We always get poor data. So its partly data manipulation, partly transformation. We are talking about reverse engineering the semantics behind the structure. Mick Bass: These are tools to faciliate the creation of instances from legacy data. Eric Miller:As much as I'd like to see a formal set of tools, I don't think its workable. All the use cases talk about getting data. There is a chicken and egg problem. Rob Tansley: I think it is a big feasibility study for RDF semantic web techniques as a whole. So how you get the data from other form into RDF is important. Kevin Smathers: Its not generally solvable, but you might be able to make a tool that helped people. Eric Miller: The other thing is a vocabulary is way of rendering the content in a schema. What we want to do is annotate the schema with rendering characteristics. Kevin Smathers: So what is Haystack? Eric Miller: If you extract Haystack from the client, then it solves that piece? MacKenzie Smith: There are more generalised versions of that problem. Eric Miller: Haystack has the potential to do that. Mark Butler: This is a general problem - we want declarative MVC approaches. Other examples include XUL and XForms. Eric Miller: I'd through RDFlib and Redfoot into the mix as well. They try to render to the web. The guy who has done that has done a great job of rendering RDF to the web. Another one he's done is RDFSchema.info, that harvests schemas, talks about the popularity of properties in the instance data. This is just an example of how you can use RDF to annotate schemas. Kevin Smathers: Where's the syndicated data on Redfoot.net? What's the other URL? Eric Miller: rdfschema.info. I'll take an action item to put out a bit more information on this, on rendering RDF based on schemas and instance data. So generic components here are a scalable RDFStore, RDFQuery, basically a lot of what Jena provides e.g. graph APIs. Without going into a specific use case those are the general things that jump out at me. Once we have general functional requirements. For me leveraging Jena, Joseki etc. I'm assuming we are not going to spend lots of time evaluating others, we are just going to use those. Mick Bass: Combining those in interesting ways, i.e. getting a schema and providing an instance editor for it. Eric Miller: My thoughts is we are going to create an editing vocabulary and tailor that vocabulary to produce a variety of interfaces to the user. So if we have a title property, we might associate it with a literal string form. So constaints on elements are necessary ones that are done at the schema level. So you still need that bit of information that says how do you move from the schema declaration to how you present to the user. Mick Bass: MK, in VRA and IMS, will we need to have these mechanisms? MacKenzie Smith: You don't have to, but we would recommend you have those. I disagree with David's comments on instance validation. We may want to override it, but we want validation. Mick Bass: There is an interesting prototype demonstrator here. There is probably a demonstator that says show me a selector that says show me a selector that allows me to chose one or more values from a controlled vocabulary. This uses annotations to generate a UI. Eric Miller: This is what we did with Cork. There is a finite set of actions. Mick Bass: So Getty has 0.25 million terms. How do you pick one? Eric Miller: From the cork standpoint, you got a search interface and dragged it back. MacKenzie Smith: Eventually we will have a local version of the AAT. Kevin Smathers: There is one very large open database that is an authority file for DMOZ. Eric Miller: DMOZ is running with one person, and a machine that is about to die. So what I want for any topic is that is to go broader or narrower on that. Mick Bass: There are two paths: something might be local, or it might be remote and how you interface to it. MacKenzie Smith: You can license this whole thing, it costs us money but we can do it. Eric Miller: All of the RDF-debugger, I don't think I've seen one written in Jena. Mark Butler: What about BrownSauce? Eric Miller: Yes, that's right, that must be built on Jena. Mark Butler: I have a reservation here. What you seem to be talking about is very much like XForms, or rather RDF-Forms. I worry that it is dangerous to try reinvent things, we would be better leveraging them. For many use cases, RDF/XML and XML are effectively two competing formats for data. We shouldn't need to reinvent all the tools that have been created for XML for RDF, we should re-use them. I know thats hard to do at the moment, but I think it is possible to fix that. The XForms folks have done a lot of good work, we should use it. Eric Miller: Well maybe you know better than me here Mark, but my understanding was that XForms is not being used. Mark Butler: I don't know. I think it is being used, and there is interest in it, but people are implementing subsets of the spec. There was a lot of interest in the mobile community in using it, but I know some people were concerned about needing XML Schema processors. They thought this was too much for mobile devices. However, if we layer RDF on-top, at least in its current form, footprint problems get worse. Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Friday, 11 April 2003 08:41:32 UTC