RE: ungetable http URIs

Hi Stefano

> Please excuse my ignorance on the topics (I'm trying to get up to 
> speed, but I have still a good way to go), but it seems to me that 
> between URL gettability and flat huge xml files served straight from 
> disk by Apache HTTPd, there is a *long* range in between, 
> depending on 
> what granularity you want to access that data.

> Let me understand this first: what kind of data would you want to 
> access? what you would want to *get* from those URLs? what kind of 
> granularity are you envisioning?

Actually the email you quote was pretty confusing. I'll try to explain
it better.

We have an XML version of the Artstor corpus that is 100,000 records and
272 megabytes in size. I split this into 34 files approximately 8
megabytes in length to make it more manageable. I then ran it through an
XSLT transform to convert it into RDF/XML, creating 34 files about 29
megabytes each.

When you do the conversion to RDF, then it becomes apparent that each
Artstor record corresponds to approximately several "data objects" By
data object, I mean each bit of subgraph that is associated with a
unique URI. See
"RDF Objects." Alex Barnell.
http://www1.bcs.org.uk/DocsRepository/03700/3772/barnell.htm

These data objects are
- a descriptive metadata object
- five technical metadata objects each describing an image at a
different resolution of the resource described by the previous object
- a collection object
- a creator object
- in fact possibly one or more objects from each of the following
categories: geographic, id, image_id, largerEntity, material, object_id,
series, site, source, subject, and topic

Now some of these objects overlap e.g. there might be a number of
Artstor records with the same creator, topic or geography etc. So I'd
estimate that for each Artstor record there is 10 data objects, so this
means we have 1,000,000 data objects in the corpus. 

So the point I was making is we are going to make every data object
retrievable via Apache, then we need to take the 1 big file, or the 34
files, and break them up into 1,000,000 files corresponding to every
data object. So, at least for the demo, it seemed to me that doing this
via Apache isn't a good plan, unless we have something that can generate
those 1,000,000 files automatically. 

Andy's Joseki server can make the data objects available, although via a
query, so this seems like a good candidate, although I note you are
prefer URLs to queries - see discussion later. 

I think there are two reasons why getable / ungetable has been
confusing. First, currently all the Artstor data we have is in this
file, so it isn't necessary to look elsewhere for data, except perhaps
to retrieve the schema. Second Haystack, when it sees a URL, tries to
retrieve it automatically. So when the Haystack team loaded the Artstor
data into Haystack, it used to produce a lot of 404 errors, because it
tried to retrieve all the URLs. 

Assuming we stick with URLs then there are two ways to overcome this
problem a) put something behind the URL or b) make the URL to a URN, so
Haystack doesn't try to retrieve anything so the revised data tries to
take approach a).

> This means that if you want, say, the entire RDF schema, you access
> 
>   http://whatever/category/schema/
> 
> or if you want just a concept you can do
> 
>   http://whatever/category/schema/concept

People do seem to create RDF namespaces like this as well (I think DC,
and definitely LOM seem to spread their schemas over a number of URLs)
but it can make things ugly because

- depending on the processing model used by the processor looking at
your RDF, you may need to put a copy of the schema at every concept URL,
not just at the schema URL

- RDF tools and serializations vary in how much namespace abbreviation
they can do, and certain namespace conventions work better with certain
serializations than others (consider my email about the CIDOC vocabulary
in N3). Using a namespace convention like this means that most tools
will have to quote the entire namespace rather than being able to use a
prefix and personally I think that reduces readibility. As I've noted
before, I think abbreviating namespaces is important to make RDF
readable, so for example as N3 is a bit better at this than RDF/XML this
generally makes N3 a bit more readable. 

However, the point you are making is about really big schemas, so then
these issues disappear because i) as you note splitting the schema up
makes it easier to use and b) if its that big we won't be loading it
into a text editor to read it :) 

> note that I personally dislike the # notation that RDF/XML 
> keeps using 
> so much exactly because anchors are not supposed to be driven server 
> side and prevent this kins of dynamic operations on the server side. 
> But the # notation is ingrained into the very nature of RDF/XML and 
> this is, IMO, really sad because will turn out to be a huge 
> issue later 
> on, expecially when RDF schemas get bigger and bigger.

Yes, I understand it has been an issue of contention. 

> Anyway, as for implementation, RDF is highly relational (and that's, 
> IMO, the reason why the RDF/XML syntax feels so ugly) so it would be 
> straightforward to store it into a relational database (I 
> remember that 
> last year Eric Prud'hommeaux was working on defining those abstract 
> mappings).

One of the key problems SIMILE is facing is dealing with heterogeneous,
semi-structured data. So we could put this in a relational database, but
its awkward. One of the things I learnt early on in the project is
DSpace use of a relational database is unusual in the library community
- most library systems are based on hierarchical databases for this very
reason.

However we now have some other databases with similiar properties to
hierarchical databases e.g. object databases, semistructured databases,
XML databases and finally persistant RDF models (Andy tells me the SW
community prefer the term RDF knowledge bases - ugh). I think if we got
a firm of consultants in to solve the SIMILE problem, they would
probably use something like Tamino - in fact a number of other projects
similiar to SIMILE such as TED or Gort are doing exactly this. However
one of the goals of SIMILE is to demonstrate the Semantic Web works, so
we have to take the persistant RDF approach. 

For example Jena uses JDBC to persist RDF models using databases like
MySQL or Postgres, then Joseki makes those databases querable via the
web. 

> I highly recommennd against this approach. if you want URIs 
> to be long 
> lasting, you can't associate them to the semantic of retrieval or 
> you'll be stuck with it forever.
> 
>   http://whatever/category/schema/concept
> 
> is, IMHO, much more long-lasting than anything like
> 
>   http://whatever/lookup-service?get="schema/concept"
> 
> Concerns should be kept separate, even if this makes the job 
> a harder. 
> In my experience, keeping concerns separate *does* pay off later on, 
> resulting in a steeper curve in the beginning, but a nicer plateau 
> later.

Although Joseki uses the latter, it's a simple matter to write a servlet
so you can use the former, and then rewrite those queries and pass them
on to Joseki, so does the distinction really matter?

Also we are taking about instance data rather than schemas here, so to
help further discussion, here are the three possibilities are

i) http://whatever/collection/dataobjecttype#dataobject

This is approach currently proposed. Note: the reason for including
dataobjecttype is to generate unique URLs, rather than to place metadata
in the URL as this would a bad thing. 

ii) http://whatever/collection/dataobjecttype/dataobject

(Stefano's preference)

iii)
http://whatever/lookup-service?get=collection/dataobjecttype/dataobject

(how you would query Joseki)

hope this helps, best regards

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/



Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
 

Received on Friday, 5 December 2003 11:51:56 UTC