Minutes from PI call 15 August 2003 from Butler, Mark on 2003-08-15 (www-rdf-dspace@w3.org from August 2003)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Fri, 15 Aug 2003 18:13:23 +0100
To: "'SIMILE public list'" <www-rdf-dspace@w3.org>
Message-ID: <5E13A1874524D411A876006008CD059F066A1E83@0-mail-1.hpl.hp.com>
 You are now talking on #simile
<em> AndyS, runnning late - will join in 10
<AndyS> em: its just starting - I'll pass on the message
--> mickBass (bass@192.6.19.124) has joined #simile
<mickBass> attending: Andy S, Mark B, John G, Mick B
--> kevins2 (chatzilla@192.6.19.104) has joined #simile
<mickBass> attending: Kevin S
<mickBass> MacKenzie has confirmed by email, awaiting her arrival
<mickBass> em to join in ~10 mins
<marbut> mickBass: when we say calendar 2003 year end, this is an internal
check point
<marbut> the checkpoint is to have that demo ready and executed within the
team, so it can be presented to the alliance steering committee after the
new year
<marbut> the target is to nail all these milestones except the last one
<marbut> (now joining: MacKenzie Smith)
<marbut> (Other attendees: Mick Bass, Kevin Smathers, Andy Seaborne, John
Gilbert, Mark Butler)
<marbut> the first document frames what are we after with this first year
demo
<marbut> the second frames the contributions of the different people
involved in the work
<marbut> so the agenda for the day has 3 items:
<marbut> look at gathering the corpus, as that's the item that is front and
center
<marbut> so we've had some contact with Eric, he's going to join in 10
minutes
<marbut> so I wanted to have some discussion with you (MS) and EM about
getting the content
<marbut> the second item is discussion and review of key milestones document
<marbut> the third item is discussion and review of the who / what / why
document
<marbut> comments to the list are also welcome
<marbut> ms: I understand you have some data from AMICO. Does that include
content?
<marbut> KS: Yes
<mickBass> KS: I think it is 1 tiff per record
<marbut> MS: we did hear back about CIDOC, I think there is some test data
available for that schema
<marbut> (Now joining: David Karger)
<mickBass> marbut: no RDFS for VRA-core, have had a go at creating one
today, have just sent to list.
<marbut> MS: The AMICO one is a version of VRAcore?
<mickBass> marbut: AMICA schema is similar, but should revisit some
decisions
<marbut> MS: I could put someone on extracting records, if we need that?
<mickBass> KS: AMICA provided 50 records
<marbut> (now joining: Eric Miller)
<marbut> MS: The message about CIDOC doesn't give any sense of numbers
<mickBass> MS: message from Martin Doerr gives no numbers.  "several test
sets are available"
<marbut> we could get some records in that schema, its not the top priority
as its not VRA
<marbut> mickBass: When we think about the test corpus, we need to generate
two islands. The questions are tactically whats 
<marbut> the best way of doing that?
<marbut> and how do we get as large an island as possible?
<marbut> and there is a risk analysis piece?
<marbut> MS: There is a lot of VRA data, but not much IMS data
<marbut> mickBass: The demo will be more convincing if we have a big
database for both types of records
<mickBass> queue andy seaborne
<marbut> DK: Can we can the demo, e.g. just select the records in the second
set so that they will match the queries
<marbut> MS: Yes, maybe the thing to do is a simple fake demo that shows the
technology will work
<mickBass> EM: on IMS, have been in contact with ? re: access to collections
through edutella
<mickBass> MS: OK to use other peoples data for demo?
<mickBass> DK: yes
<mickBass> MS: yes, OK.
<mickBass> MS: eventually all OCW records will get exported to IMS.
<AndyS> [There are only 7 Amico tiffs - 206 records in the /data directory]
<marbut> em: Do we need more AMICO records?
<mickBass> EM: edutella has been running for several years...  hopefully
larger number of IMS records
<AndyS> [ 3270 RDF statements ]
<marbut> KS: There are some things missing from the AMICO data, reference to
vocabuaries which are just implied
<marbut> em: I made them up
<marbut> KS: If validity is important, we need controlled vocabularies also
<marbut> em: AMICO has there only way of structuing this stuff, using three
letter element DTDs
<marbut> This was an attempt to say "lets do this using standards"
<marbut> so it takes a small set, shows how you can represent it in RDF, but
now you could use it in other ways, it doesn't take a year to do
<marbut> so they are using it in some test beds now, so if they've used it
more we should be able to get some more data
<marbut> fairly quickly
<marbut> mickBass: For these 2 schemas, the way we ought to be behaving is
to try to pull in schemas / instances from where ever they may be. Because
<marbut> then we create options.
<marbut> It sounds like there is a risk, particularly in IMS, that we won't
be able to get a large number of records?
<marbut> MS: What do you mean by large?
<marbut> mickBass: 20K records
<marbut> MS: As DK said, there are 2 phases here - get the demo to work,
then scale it up
<marbut> KS: Isn't scalability part of the demo?
<marbut> MS: If you want a large amount of quality data that you can demo,
it may take more than 3 months
<marbut> mickBass: As andy has pointed out, we can work with the amico data
now. But until we have a substantial corpus of records, its hard to make
<marbut> statements about the type of mapping rules we need to support
<marbut> em: with the OCLC, it wasn't the mapping rules, its just the amount
of data meant "so what"
<marbut> you need the collection to get people's interest piqued
<marbut> mickBass: I'm not saying we don't want to work on the technology,
but I don't want us to stop getting the metadata
<marbut> I'm hearing theres risk, I'm not hearing if its possible
<marbut> MS: VRAcore should be okay, IMS is too new, hasn't borne fruit yet
<marbut> (how do I tell if rssagent is running?)
<marbut> DK: Is 20K records enough to demonstrate scalability - we would
need a few million?
<marbut> KS: At the plenary, we agreed 100K was okay
<marbut> MS: I think there are 20K VRAcore records, not sure if we can get
this for IMS
<marbut> AS: We want a total corpus of 100K records, i.e. it has to be on
disk, to check the memory to disk
<mickBass> KS: could target VRA to DC mapping
<em> do people know about http://www.mindswap.org/2003/CancerOntology/ ?
<marbut> (yep, put it in the bibliography)#
<em> good, thanks - not sure if instance data corresponding to this would be
of use to the group or not
<marbut> DK: I'd love DC to be part of this as it is very generic
<marbut> MS: There are some pretty big collections of image data with DC
descriptors
<marbut> mickBass: would it be a better demo to use vra and DC, than vra and
IMS?
<marbut> DK: Scalability and interoperability don't have to be done at the
same time. We could demo interop on small
<marbut> collections, then scalability just on one record set
<marbut> mickBass: one thing that might be useful is to create a list of
options
<marbut> eric, mackenzie, I think you have this list in your head
<marbut> MS: It's more that just data, we need licenses as well e.g. what
will happen to the data
<marbut> So each of these will take a month to talk to. So if you are
talking significant numbers of records, AMICO has them
<marbut> I haven't seen a script for the demo yet, so I'm not sure what we
are trying to accomplish.
<marbut> They will want to know what it is we want to do. Until we know what
we want to do, I need to know what the demo is,
<marbut> who it is for, etc
<marbut> mickBass: I could take demo 1a and 1b and translate it to a script
and some sort of statement of intention
<marbut> Mark: we need to look back at the OCLC demo, see how it works,
before moving onto the script
<marbut> em: this means we need to have data that is complimentary for what
we want to do
<marbut> mickBass: there's a cyclical dependency here, I write it a
description e.g. the type of data, what we want to do it,
<marbut> who we want to show it to, want the boundaries are, but in some
sense we can't lock down the demo script, the script
<marbut> is dependent on data and vice versa
<marbut> em: I'm not running into that problem yet,
<marbut> mickBass: I can write it in prose, 
<marbut> ms: I want to come back to the number of records, what makes a demo
compelling is that it does something that people care about
<marbut> we need to taken a different tack this time, even if it means
handwriting a hundred records
<marbut> I fundamentally disagree that we need to get a lot of records
<marbut> mickBass: there are 2 stages, we need to decide where to set the
bar
<marbut> if we hand craft the data set, and its possible to get useful
results
<marbut> then people might say "this works, but you had to hand craft the
data"
<marbut> I'm also hearing that if thats all you do its not quite as
compelling as it could be
<marbut> with IMS provided by the community, you can actually do useful
recall across these mapping technologies.
<marbut> at some point you have to demostrate it works, even if the records
aren't handcrafted
<marbut> em: I'm can make a case for scalability. The OCLC demo was slow
with a small number of records
<marbut> you want to make it compelling, but if you are not as fast as
google its tough to say "look at the flexibility"
<marbut> some of us need to be focussing on data collection, some on speed
and performance, we can bring them together later
<marbut> mickBass: one is corpus gathering, the other is tidying up the
specific demo script
<marbut> we need to run these in parallel for a little while
<marbut> can we agree on thinking about vra, dc and ims?
<marbut> and can we agree to start to think about the list?
<marbut> MS: the list is 4 places for VRA, 0 for IMS
<marbut> I can start asking around, but I'm not sure what its achieving
<marbut> mickBass: what about DC?
<marbut> em: sure
<marbut> right now?
<marbut> I can get a collection from Adobe as XMP
<marbut> MS: But is it just Dublin Core, or image collections?
<marbut> em: you could broaden this to photo.net or rdf.pic or photo
metadata etc. We could try other vocabularies e.g. friend of a friend
<marbut> MS: we need to story board the demo to show what we want to show
<marbut> MS: For us to go a write down every type of data in the whole world
<marbut> DK: I've got to go
<marbut> mickBass: I want the team involved it? I need to get it out your
head, so the team can work on it?
<marbut> MS: may EM has lists in his head, but I have to do research, so its
not a matter of a quick brain dump
<marbut> what I bring is connections to people who may or may not have data
<marbut> So I am also feeling a bit stuck
<marbut> em: it could be museum things, or satelite things etc
<marbut> ms: perhaps mick, em and me have to have a call and decide where to
get the data from
<marbut> I'm willing to get us some data, but only if its something we
actually need and will use
<marbut> we need to put some more thought into the demo
<marbut> mickBass: I think it is a good idea for EM / MS and me to do this
<marbut> MS: I heard about another big project recently that might be
relevant, which is based in Manchester
<marbut> Jorum
<mickBass> MS: jorum, big project to build learning object repositories
<marbut> this project looks complimentary, they may / may not have test
data, it would be good to talk to them
<marbut> they are almost certain to be at the meeting, because they are also
backing the project
<marbut> mickBass: it sounds like Paul is going to go
<marbut> I'll have a conversation with Mark and Paul, check we have coverage
there
<marbut> if these folks have IMS image data that they can share, that might
be helpful
<marbut> MS: I'd love to see this demo be used by an academic who uses this
for courses
Received on Friday, 15 August 2003 13:13:45 UTC