RE: ungetable http URIs from Smathers, Kevin on 2003-12-05 (www-rdf-dspace@w3.org from December 2003)

From: Smathers, Kevin <ks@exch.hpl.hp.com>
Date: Fri, 5 Dec 2003 10:59:01 -0800
To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>, SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <40700B4C02ABD5119F0000902787664408CF35B5@hplex1.hpl.hp.com>
Joseki queries do not always have to look like they do today.  My recommendation for an implicit fetch type query from Joseki was to make URL's of subgraphs look more like what Stefano was suggesting (regardless of '#' versus '/' versus '?'.)

-----Original Message-----
From: www-rdf-dspace-request@w3.org [mailto:www-rdf-dspace-request@w3.org] On Behalf Of Butler, Mark
Sent: Friday, December 05, 2003 8:52 AM
To: SIMILE public list
Subject: RE: ungetable http URIs


Hi Stefano

> Please excuse my ignorance on the topics (I'm trying to get up to
> speed, but I have still a good way to go), but it seems to me that 
> between URL gettability and flat huge xml files served straight from 
> disk by Apache HTTPd, there is a *long* range in between, 
> depending on 
> what granularity you want to access that data.

> Let me understand this first: what kind of data would you want to
> access? what you would want to *get* from those URLs? what kind of 
> granularity are you envisioning?

Actually the email you quote was pretty confusing. I'll try to explain it better.

We have an XML version of the Artstor corpus that is 100,000 records and 272 megabytes in size. I split this into 34 files approximately 8 megabytes in length to make it more manageable. I then ran it through an XSLT transform to convert it into RDF/XML, creating 34 files about 29 megabytes each.

When you do the conversion to RDF, then it becomes apparent that each Artstor record corresponds to approximately several "data objects" By data object, I mean each bit of subgraph that is associated with a unique URI. See "RDF Objects." Alex Barnell. http://www1.bcs.org.uk/DocsRepository/03700/3772/barnell.htm

These data objects are
- a descriptive metadata object
- five technical metadata objects each describing an image at a different resolution of the resource described by the previous object
- a collection object
- a creator object
- in fact possibly one or more objects from each of the following
categories: geographic, id, image_id, largerEntity, material, object_id, series, site, source, subject, and topic

Now some of these objects overlap e.g. there might be a number of Artstor records with the same creator, topic or geography etc. So I'd estimate that for each Artstor record there is 10 data objects, so this means we have 1,000,000 data objects in the corpus. 

So the point I was making is we are going to make every data object retrievable via Apache, then we need to take the 1 big file, or the 34 files, and break them up into 1,000,000 files corresponding to every data object. So, at least for the demo, it seemed to me that doing this via Apache isn't a good plan, unless we have something that can generate those 1,000,000 files automatically. 

Andy's Joseki server can make the data objects available, although via a query, so this seems like a good candidate, although I note you are prefer URLs to queries - see discussion later. 

I think there are two reasons why getable / ungetable has been confusing. First, currently all the Artstor data we have is in this file, so it isn't necessary to look elsewhere for data, except perhaps to retrieve the schema. Second Haystack, when it sees a URL, tries to retrieve it automatically. So when the Haystack team loaded the Artstor data into Haystack, it used to produce a lot of 404 errors, because it tried to retrieve all the URLs. 

Assuming we stick with URLs then there are two ways to overcome this problem a) put something behind the URL or b) make the URL to a URN, so Haystack doesn't try to retrieve anything so the revised data tries to take approach a).

> This means that if you want, say, the entire RDF schema, you access
> 
>   http://whatever/category/schema/
> 
> or if you want just a concept you can do
> 
>   http://whatever/category/schema/concept

People do seem to create RDF namespaces like this as well (I think DC, and definitely LOM seem to spread their schemas over a number of URLs) but it can make things ugly because

- depending on the processing model used by the processor looking at your RDF, you may need to put a copy of the schema at every concept URL, not just at the schema URL

- RDF tools and serializations vary in how much namespace abbreviation they can do, and certain namespace conventions work better with certain serializations than others (consider my email about the CIDOC vocabulary in N3). Using a namespace convention like this means that most tools will have to quote the entire namespace rather than being able to use a prefix and personally I think that reduces readibility. As I've noted before, I think abbreviating namespaces is important to make RDF readable, so for example as N3 is a bit better at this than RDF/XML this generally makes N3 a bit more readable. 

However, the point you are making is about really big schemas, so then these issues disappear because i) as you note splitting the schema up makes it easier to use and b) if its that big we won't be loading it into a text editor to read it :) 

> note that I personally dislike the # notation that RDF/XML
> keeps using 
> so much exactly because anchors are not supposed to be driven server 
> side and prevent this kins of dynamic operations on the server side. 
> But the # notation is ingrained into the very nature of RDF/XML and 
> this is, IMO, really sad because will turn out to be a huge 
> issue later 
> on, expecially when RDF schemas get bigger and bigger.

Yes, I understand it has been an issue of contention. 

> Anyway, as for implementation, RDF is highly relational (and that's,
> IMO, the reason why the RDF/XML syntax feels so ugly) so it would be 
> straightforward to store it into a relational database (I 
> remember that 
> last year Eric Prud'hommeaux was working on defining those abstract 
> mappings).

One of the key problems SIMILE is facing is dealing with heterogeneous, semi-structured data. So we could put this in a relational database, but its awkward. One of the things I learnt early on in the project is DSpace use of a relational database is unusual in the library community
- most library systems are based on hierarchical databases for this very reason.

However we now have some other databases with similiar properties to hierarchical databases e.g. object databases, semistructured databases, XML databases and finally persistant RDF models (Andy tells me the SW community prefer the term RDF knowledge bases - ugh). I think if we got a firm of consultants in to solve the SIMILE problem, they would probably use something like Tamino - in fact a number of other projects similiar to SIMILE such as TED or Gort are doing exactly this. However one of the goals of SIMILE is to demonstrate the Semantic Web works, so we have to take the persistant RDF approach. 

For example Jena uses JDBC to persist RDF models using databases like MySQL or Postgres, then Joseki makes those databases querable via the web. 

> I highly recommennd against this approach. if you want URIs
> to be long 
> lasting, you can't associate them to the semantic of retrieval or 
> you'll be stuck with it forever.
> 
>   http://whatever/category/schema/concept
> 
> is, IMHO, much more long-lasting than anything like
> 
>   http://whatever/lookup-service?get="schema/concept"
> 
> Concerns should be kept separate, even if this makes the job
> a harder. 
> In my experience, keeping concerns separate *does* pay off later on, 
> resulting in a steeper curve in the beginning, but a nicer plateau 
> later.

Although Joseki uses the latter, it's a simple matter to write a servlet so you can use the former, and then rewrite those queries and pass them on to Joseki, so does the distinction really matter?

Also we are taking about instance data rather than schemas here, so to help further discussion, here are the three possibilities are

i) http://whatever/collection/dataobjecttype#dataobject

This is approach currently proposed. Note: the reason for including dataobjecttype is to generate unique URLs, rather than to place metadata in the URL as this would a bad thing. 

ii) http://whatever/collection/dataobjecttype/dataobject

(Stefano's preference)

iii) http://whatever/lookup-service?get=collection/dataobjecttype/dataobject

(how you would query Joseki)

hope this helps, best regards

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/



Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Friday, 5 December 2003 13:59:05 UTC