Yahoo! Messenger: Conference robert_tansley-3442 started. Yahoo! Messenger: dstuve has joined the conference. Yahoo! Messenger: peter_breton has joined the conference. mbass6: hello? robert_tansley: could easly script that peter_breton: Hi mick robert_tansley: Hi Mick mbass6: sorry 10 mins late. robert_tansley: That's odd, I didn't get a message saying you'd joined, Mick mbass6: Well, I didn't. I just showed up at my desk and the window was open, w/ conference started. peter_breton: In vanilla CVS you could script it easily -- but here at MIT it ain't so easy mbass6: Peter: can you email me log of YIM so far? -- peter_breton: We just started robert_tansley: Mick: have you seen the latest diagrams I sent? It's probably best if you give those a quick glance robert_tansley: We haven't discussed anything substantive yet. mbass6: Have diagrams, have examined. Looks cool. peter_breton: Ah, NOW I get the "joined the conference" message peter_breton: Actually I see TWO of you, Mick! mbass6: I just found the "join my conference window" and pressed the join button. mbass6: wierd. peter_breton: Is one of them your evil tiwn? mbass6: It's a turing test!! mbass6: Will the real mick please type now... peter_breton: Maybe one of them is Mini-Mick peter_breton: mbass6: So tell us about your picture, Robert... -- robert_tansley: well, the basic idea is to have a "bundle metadata" bit-stream (or DBMS row, etc) robert_tansley: which maps bit-stream IDs to filenames robert_tansley: those two diagrams are two view of how it could fit in (there are obviously many more permutations) robert_tansley: Dave: that mail you sent me yesterday - did it just go to me? robert_tansley: (the HTML tree one) dstuve: Yes - just trying out an idea. dstuve: I take it you liked it? dstuve: robert_tansley: well it's an approach robert_tansley: not sure about making it a "special kind of bit-stream" though peter_breton: Could you give us a brief recap? robert_tansley: We can use policy to solve the problem of external references. I say we put our foot down and say something like "If you want to archive HTML, it must be self contained." (No external references, all relative links.) robert_tansley: [incoming spam] robert_tansley: Typing is difficult, and especially so if you try and consider a tree of HTML parts as a single document. Why not have a type of 'HTML tree'? We can have a special bitstream that is not a file but more like a directory or tree of files, and we can give it a tree type. We can then reference sub-parts to the bitstream like it was a directory. (Easy and powerful paradigm.) Example: HTML tree with a page and a gif submitted bitstream72 created, of type html-tree or folder you could then reference bitstream72/index.html or bitstream72/images/test.gif, etc. peter_breton: Ah, yes, you did say this in person yesterday peter_breton: I disagree on the first part -- the "must be self contained" dstuve: Basically it's a folder object of some sort - much like Robert has drawn with the 'bundle' object. peter_breton: The second part -- the directory type -- is interesting mbass6: [folder] I see this as contextual metadata about the local environment from which the bitstreams were submtited. mbass6: >> we need to get the semaphore going -- robert_tansley: the second diagram shows how you could do away with the bundle object - I'm not sure whether that's wise robert_tansley: -- peter_breton: >> peter_breton: 1) Regarding the policy, I think we should be flexible >> peter_breton: What we want to say is more like, "DSpace 1.0 provides the following level of support for HTML documents" >> peter_breton: This may amount to the same thing in practice >> dstuve: peter, let's talk about the bundle object instead. peter_breton: Are we all on the same page with the policy then? >> peter_breton: -- mbass6: >> I'd like to suggest that we run with the "flexible policy" for now, as Peter suggests, and move on to the bundle object -- robert_tansley: << robert_tansley: The bundle metadata could contain a flag "refers to external content", which, when present, could prompt a dissemination service to say "some of the links in this are out of DSpace's control" or somesuch. -- robert_tansley: << robert_tansley: A question is, is that bundle object specific to HTML, or general to multi-file formats? >> robert_tansley: Perhaps that example of an e-mail with embedded MIME objects would be worth working through (but probably offline) -- mbass6: >> mbass6: Peter recently suggested that we address the "what extra info gets stored" directly with schemas that we construct. >> mbass6: It seems that Robert's suggestion to include "HTML Bundle Metadata" lines up with this. >> mbass6: We would define a schema about what info we grab upon submission. >> mbass6: That would be independent from any inferences downstream services might be able to make. -- dstuve: << dstuve: Bundling could handle the case of compound documents of all sorts, like 50-page theses where each page is a tiff dstuve: -- robert_tansley: << robert_tansley: Sure - would we need a separate bundle metadata schema in each case? -- mbass6: << mbass6: My vote: yes, especially initially. We may discover that they can converge. >> mbass6: But I would imagine that the contextual MD required to effectively service a thesis would be different from that required to service a compound HTML doc. -- peter_breton: << peter_breton: I suggested two poles yesterday: one is hard-coding -- simply listing all the relationships we want to support, or can practically support, etc >> peter_breton: This is of course inflexible -- but it is simple, concrete and grounding >> peter_breton: The other pole is to use RDF to characterize them >> peter_breton: The advantages of doing that are we'd get some real world experience with it very quickly, and so would flush out issues with the technology and a loose-schema architecture -- robert_tansley: << robert_tansley: "Relationships" is a tricky issue >> robert_tansley: In HTML, the relationships are in the syntactic structure. in the example of the TIFF thesis pages, the relationships is a more complex issue (e.g. "page2.tif" comes after "page1.tif") robert_tansley: >> robert_tansley: Details of some of the relationships are within the bit-streams themselves, in others it has to be defined externally somehow, in the representation information. -- peter_breton: << peter_breton: You mean, defined by the user? -- robert_tansley: << robert_tansley: The submitting user? no, I don't think so. Well, it has to be worked out as part of the submission process. In the case of TIFF'd thesis, if that's a supported type, there's a standard way of storing those TIFFs and the relationship between them is known (in the representation information.) Services can then determine how to render the pages (in page order etc.) robert_tansley: -- mbass6: << mbass6: Even in the HTML Example, the info is a mixture of relationships specified in the bitstreams themselves, and relationships derived from external context. >> mbass6: In Robert's example, to know that OBJ-004 embeds OBJ-005 image, we must know about _both_ >> mbass6: the html link , and that >> mbass6: OBJ-005 has a "fig/img1.jpg" relationship with OBJ-004 (which comes from the submitter's File system, not from the html text). -- robert_tansley: << robert_tansley: That's the information captured in the HTML bundle schema. But yes, to know the relationship, you have to have the HTML file, know the HTML specs, and have the HTML bundle metadata. >> robert_tansley: However, that is pretty trivial as it happens. -- robert_tansley: << mbass6: agree -- dstuve: << peter_breton: Dave -- Robert grabbed it first! dstuve: sorry! robert_tansley: The difficulties come if you start trying to represent arbitrary relationships that an individual submitting user comes up with - e.g. they submit a load of scanned PDFs, each a single page of a paper they've written. If DSpace doesn't support a "multi-PDF document" bundle, I really don't see how we can get the user to input that relationship in a machine understandable way. -- peter_breton: Dave? dstuve: << dstuve: Bundling doesn't have to be used in all cases >> dstuve: but it seems very useful in cases where you have a load of items that seem to be related, especially with a path<< dstuve: such as index.html, images/gif.1, gif.2, etc, << dstuve: or with theses pages /page1.xxx, /page2.xxx, etc. << dstuve: then the disseminator can play tricks with incoming URL requests and figure out which part to deliver << mbass6: << dstuve: but with a multi-part document such as 10 related PDFs, you may just want to make them 10 pieces of content, no bundle. -- peter_breton: << peter_breton: Hey! peter_breton: >> peter_breton: Rob -- agree that allowing the user to input the relationships is a major difficulty >> peter_breton: In my two paradigms approach, in the first world -- hard-code -- there would simply be a pre-determined list of relationships >> peter_breton: In the more loose RDF-ish world, you could grab some items and indicate a relationship -- possibly even one of your own choosing -- between them >> peter_breton: However, the system might not actually be able to do anything with your relationship except to throw it into a triple store >> peter_breton: Or you might have well-known relationships plus the ability to add your own >> peter_breton: -- mbass6: >> robert_tansley: << [Allowing user to input relationships] I think we can punt on that one for a while - suffice if the object model is sufficient to allow that in the future. Perhaps a generic bundling schema with just the IDs/filenames and an English text description of the relationship. -- mbass6: >> peter_breton: Go ahead Mick mbass6: I think this description is quite related to 1. design objectives around the progression of content from "unkown" to "aware" to "supported" etc. >> mbass6: and 2. to the OAIS concept of "representation information". >> mbass6: Scenario: DSpace is unaware of high-level content-type "thesis" >> mbass6: But someone decides to submit a fileset with the structure that Rob/Dave describe. >> mbass6: Now, (and especially if this "first" is happening with help from DSpace staff) >> mbass6: we can clearly capture some text, prose, human-parseable description along the lines of >> mbass6: "This fileset represents a thesis as a sequence of page images, with subsequent pages indicated by a sequence of filenames" >> mbass6: or something like that. >> mbass6: Later, We may decide to strengthen support for this kind of fileset. >> mbass6: So we could define a "thesis" schema that indicated that this organization was expected, and made it easier for machine-driven services >> mbass6: to capture and/or render the thesis. >> mbass6: In The "hard-coded" world, code would just assume that >> mbass6: objects of type "thesis" had the structure described in the prose representation information. No generic relationships required (except in the code). >> mbass6: In the RDF world, we would have more generic relationships like a "sequence" class, and a "follows in sequence" relationship, or some such, >> mbass6: and these relationships would allow services to reconstruct the thesis. Do I understand those two extremes, approaches? -- peter_breton: >> peter_breton: Backing up a bit to Rob's last comment >> peter_breton: I'm a bit confused as to why representing these things to the user can be punted on >> peter_breton: Perhaps you just mean arbitrary, user-defined relationships? >> peter_breton: Cuz it seems that if we use RDF as internal representation, we should be able to represent whatever we like >> peter_breton: It's making sense of it that becomes difficult -- robert_tansley: << robert_tansley: Representing these things _to_ the user? I didn't comment about that. I commented about "getting arbitrary relationships _from_ the user". -- robert_tansley: << peter_breton: << peter_breton: Go ahead robert_tansley: Even in the case of the RDF world, getting those relationships from the user is not easy. In a lot of cases, the users won't know. >> robert_tansley: We also need to be able to distinguish between "supported" relationships (as in HTML bundles) and arbitrary user-specified relationships. >> robert_tansley: As for punting on it: We can easily define a "user-defined relationship bundle metadata schema" (or something with a better name.) That could just be some English prose. At a later date we could have a similar schema that expresses the relationships as RDF. Basically, this all fits into the model I've proposed - and it's that model we're (or at least I'm) trying to tie down now, rather than solve the whole DSpace problem in one go. -- peter_breton: >> peter_breton: "to/from the user" -- I mean that we have to be able to represent the relationships TO the user in order to get information FROM the user >> peter_breton: If the user indicates that certain parts of a document, or certain documents, etc etc are related, they do so because the system provides some means of expressing this relationship -- mbass6: >> mbass6: I would assert that this task (represent relationships to user, capture relationships from) will not be performed directly by the users for quite some time in the future, >> peter_breton: !! mbass6: but rather by libraries consulting staff who take a more interview-oriented approach based upon weird new incoming submissions. >> mbass6: These consultant-types would ask questions of the user, express what they can in existing relationships, >> mbass6: and also think about what new types of relationships need to be defined/constructed to effectively manage the corpus. -- peter_breton: >> peter_breton: Hang on -- are we talking about arbitrary user-defined relationships again? >> peter_breton: Because this is true for ANY relationships at all that we capture -- even pre-defined ones. The users CAN describe it to the system because the system provides some means for them to do so >> peter_breton: As a simple example, consider asserting that two documents differ only in format -- mbass6: >> mbass6: I agree with you. You describe the "chicken and egg" problem of how system knows to present certain questions/choices to the user. >> mbass6: I'm simply pointing out that the chicken/egg loop gets broken by libs consulting staff who decide which kinds of relationships warrant explicit support in the system. >> mbass6: Explicit typing is one way to specify a "canned" set of relationships, and librarians could define those types (e.g. this is a thesis w/ structure just like those previously submitted). -- robert_tansley: << peter: two documents differ only in format -> v. good point >> robert_tansley: Do you think that relationships between bit-streams comprising a multi-file format document, and inter-document relationships are sufficiently distinct to represent that explicitly in the architecture? -- peter_breton: Let me read that slowly.... peter_breton: I'm going to try to turn your question around.... robert_tansley: and fire it back at me peter_breton: Do you mean something like "Should we have a representation for the assertion that multiple bitstreams are related to the same file...." peter_breton: "and the assertion that there are relationships between multiple documents?" -- robert_tansley: << file would be better termed as "document instance" maybe? -- robert_tansley: << peter_breton: You're right robert_tansley: in essence, yes, that's what I'm asking - are there two "root classes" of relationship - intra-document instance and inter-document instance? robert_tansley: >> robert_tansley: The intra-document one is largely mechanical UNLESS you get to the point of representing the information within the document ("argument2 refutes statement1") etc. robert_tansley: -- robert_tansley: << robert_tansley: Re; your two poles - there's middle ground, I think. -- peter_breton: << I agree -- they are strawmen to flush out the middle ground -- robert_tansley: << peter_breton: Actually I hope there is middle ground robert_tansley: I think architecturally, we'll be tending towards the RDF side... >> peter_breton: agree robert_tansley: but as far as the user interface is concerned, that will have to be far more towards the other pole. -- peter_breton: agree also mbass6: David? dstuve: Yeah? peter_breton: at least the main UI -- it's a foundation for experiments there, possibly robert_tansley: laughs out loud mbass6: Do you have a position? -- dstuve: I think that our architecture can allow for arbitrary relationships, but we won't use them for a long while, and the GUI for presenting to the user would be intractable. peter_breton: I have to pick up the glove on that one..... mbass6: ?? - dstuve: -- peter_breton: I think you could express arbitrary relationships without too much trouble -- just pick some items from DSpace and input a relationship or make some assertion robert_tansley: I need to go... coffee talk peter_breton: But whether you could DO anything with that at all is a different question dstuve: Exactly what I said, isn't it? mbass6: Robert, do you have what you need to proceed? -- peter_breton: No no robert_tansley: << Yes, I think so - prepare for a new round of diagrams tomorrow -- peter_breton: The GUI isn't intractable at all -- in fact it's easy precisely because it's so generic robert_tansley: << dstuve: Generic = hard.... robert_tansley: Peter: making a GUI for inputting arbitrary relationships may be easy, but I wouldn't like to bet on your chances of getting anything usable out of it. It'll confuse the hell out of 99.9% of users. I doubt you'd end up wiht much worth archiving through doing that. -- peter_breton: Rob -- I agree dstuve: << wait Peter, you're talking in circles. peter_breton: No -- I'm saying that making "generic" GUIs isn't difficult or intractable peter_breton: But making generic-but-useful GUIs is dstuve: that users can use? -- robert_tansley: The coffee talk is tomorrow, I don't have to go now -- peter_breton: Psych! mbass6: >> mbass6: Shall we close for this session, or are there topics on which we can progress? -- dstuve: << I think we've beat the bundling to death -- robert_tansley: << we beat something to death, not the bundling though-- robert_tansley: << peter_breton: agree with Rob robert_tansley: OK, let me ask a question. What do people think of the direction the object model diagrams I'm making are going? The right direction? Don't know? The wrong direction? -- dstuve: << dstuve: I think it's the right direction - should we be working out diagrams for different scenarios? dstuve: -- mbass6: << mbass6: Good Direction. And I'm about to ask Robert to summarize conclusions from this dialogue... -- robert_tansley: << [scenarios] Dave- exactly what I've been doing -- robert_tansley: << The "HTML + JPEG, MARC record" is one scenario robert_tansley: >> robert_tansley: I'll mail round a list of other scenarios robert_tansley: which you can add to/ argue with/ comment on etc. -- peter_breton: that would be cool dstuve: very mbass6: >> mbass6: Peter, thoughts on Robert's direction question? -- peter_breton: Is it basically trying to identify what goes in the higher-level types? peter_breton: The kinds of relationships we need to express? mbass6: [more coming?] -- robert_tansley: all levels of types, down to individual bitstream - they're data models robert_tansley: and I suppose the relationships between them peter_breton: Then I think it's good mbass6: >> mbass6: Robert, any conclusions that you can summarize, as you perceive them? -- robert_tansley: << 1. It's a hard problem! >> peter_breton: yep peter_breton: It's core, as you said robert_tansley: 2. Although we said we weren't going to have RDF in the critical path, the data model is looking more and more RDF-like >> robert_tansley: 3. That doesn't necessarily mean (and SHOULDN'T mean) we need to rely on an RDF tool stack, but it does make the data we have far more amenable to projecting RDF and allowing RDF-type tools to do stuff with it >> robert_tansley: 4. I think we're making progress (agree?) -- peter_breton: agree peter_breton: >> mbass6: agree peter_breton: About RDF >> dstuve: me too. peter_breton: I think that using it as an internal representation is not necessarily risky >> peter_breton: The hard questions are querying, scalability of storage and so forth -- peter_breton: >> robert_tansley: << It's like we're using RDF purely as a modelling language, not as an explicit storage mechanism >> robert_tansley: -- mbass6: >> peter_breton: Yes, or an internal memory representation, or a way of linking components by passing (small!) RDF graphs between them mbass6: This is excellent progress. Any other topics that we should cover? -- robert_tansley: << robert_tansley: Peter: I have some pretty clear ideas about querying in the "fulfilling user search request" sense. I think that should all happen using harvesters. But that's a discussion for another day. -- peter_breton: ok peter_breton: >> peter_breton: Another point on RDF internally is that the alternative is something like hard-coding. peter_breton: -- peter_breton: Perhaps tomorrow on the seach/harvesting Q? robert_tansley: << peter_breton: Or if it looks like a ways away you may want to capture some thought snow peter_breton: "thoughts now" robert_tansley: Personally, I'd like to maintain momentum on this object model stuff. We could talk about the search stuff if you like, I suppose it does have some bearing -- mbass6: I'll defer to Robert's judgement/preference. -- peter_breton: I think we should stick with the object model -- maybe some email on search/harvesting? peter_breton: If it's not distracting, that is peter_breton: Dave? robert_tansley: It should be OK - I have some clear ideas but they're on a back burner for now. robert_tansley: Perhaps I could mail an agenda for tomorrow's scheduled YIM meeting? dstuve: sure. peter_breton: Sounds good mbass6: that sounds good! -- mbass6: >> dstuve: I'd like a little email background first. mbass6: new topic (brief) >> mbass6: I've created a picture of DSpace software development tasks to aid with staffing / allocation of resources, in an attempt to ensure that we can get this all done w/o burning out. >> mbass6: I'll email it to you. Can you respond with comments and holes? -- peter_breton: certainly dstuve: << sure -- robert_tansley: OK. Are we done? mbass6: FInsihed? -- peter_breton: Definitely flnsihed dstuve: sure. mbass6: >> mbass6: whoops one more topic. mbass6: >> mbass6: I'd like to suggest that we each subscribe to www-rdf-dspace@w3.org. I'll mail directions to you. THis is the list that we set up in concert with the W3 Advanced development team. >> mbass6: I think it prob. makes sense to post this log there, so Eric Miller, Art Barstow, Ralph Swick can see where we are in our thinking. Comments? -- peter_breton: I am already subscribed -- and some of you have be already subscribed too peter_breton: "may be" robert_tansley: I'm also subscribed already mbass6: any objections to posting this log there? -- dstuve: gosh, I guess I'm the only one not subscribed. I'll do it. peter_breton: No go ahead, post it. Erase all the naughty words first! peter_breton: (see, they'll see this and figure that a bunch of expletives have been deleted) peter_breton: (unless Mick removes these lines which he probably will) mbass6: not mbass6: OK, I'll post, and CC: to dspace-code. mbass6: Thanks all. -- peter_breton: OK -- see ya soon peter_breton: See you tomorrow Rob robert_tansley: see you later Yahoo! Messenger: peter_breton has left the conference.