Re: ungetable http URIs from Stefano Mazzocchi on 2003-12-06 (www-rdf-dspace@w3.org from December 2003)

From: Stefano Mazzocchi <stefano@apache.org>
Date: Fri, 5 Dec 2003 23:05:06 -0800
To: Mark Butler <Mark_Butler@hplb.hpl.hp.com>
Cc: SIMILE public list <www-rdf-dspace@w3.org>
Message-Id: <85865B1E-27BA-11D8-AB98-000393D2CB02@apache.org>
On 5 Dec 2003, at 08:50, Butler, Mark wrote:

> Hi Stefano

Mark,

thanks so much for taking the time to explain all this to me.

>> Please excuse my ignorance on the topics (I'm trying to get up to
>> speed, but I have still a good way to go), but it seems to me that
>> between URL gettability and flat huge xml files served straight from
>> disk by Apache HTTPd, there is a *long* range in between,
>> depending on
>> what granularity you want to access that data.
>
>> Let me understand this first: what kind of data would you want to
>> access? what you would want to *get* from those URLs? what kind of
>> granularity are you envisioning?
>
> Actually the email you quote was pretty confusing. I'll try to explain
> it better.
>
> We have an XML version of the Artstor corpus that is 100,000 records 
> and
> 272 megabytes in size. I split this into 34 files approximately 8
> megabytes in length to make it more manageable. I then ran it through 
> an
> XSLT transform to convert it into RDF/XML, creating 34 files about 29
> megabytes each.

Ok.

> When you do the conversion to RDF, then it becomes apparent that each
> Artstor record corresponds to approximately several "data objects" By
> data object, I mean each bit of subgraph that is associated with a
> unique URI. See
> "RDF Objects." Alex Barnell.
> http://www1.bcs.org.uk/DocsRepository/03700/3772/barnell.htm

I can't access that, but it's ok, I get the idea.

> These data objects are
> - a descriptive metadata object
> - five technical metadata objects each describing an image at a
> different resolution of the resource described by the previous object
> - a collection object
> - a creator object
> - in fact possibly one or more objects from each of the following
> categories: geographic, id, image_id, largerEntity, material, 
> object_id,
> series, site, source, subject, and topic
>
> Now some of these objects overlap e.g. there might be a number of
> Artstor records with the same creator, topic or geography etc. So I'd
> estimate that for each Artstor record there is 10 data objects, so this
> means we have 1,000,000 data objects in the corpus.

ok

> So the point I was making is we are going to make every data object
> retrievable via Apache, then we need to take the 1 big file, or the 34
> files, and break them up into 1,000,000 files corresponding to every
> data object. So, at least for the demo, it seemed to me that doing this
> via Apache isn't a good plan, unless we have something that can 
> generate
> those 1,000,000 files automatically.

Be careful, though. Apache is not able, by default, to use the anchor 
information to retrieve file, so you would need to write, at the very 
least, a special module for that.

Andy's Joseki server can make the data objects available, although via a
> query, so this seems like a good candidate, although I note you are
> prefer URLs to queries - see discussion later.

yes, I far prefer URLs to queries for resources because that prevents 
to expose the lookup metodology.

> I think there are two reasons why getable / ungetable has been
> confusing. First, currently all the Artstor data we have is in this
> file, so it isn't necessary to look elsewhere for data, except perhaps
> to retrieve the schema. Second Haystack, when it sees a URL, tries to
> retrieve it automatically. So when the Haystack team loaded the Artstor
> data into Haystack, it used to produce a lot of 404 errors, because it
> tried to retrieve all the URLs.

Ok. Question: is that so bad? I mean, URIs might be designed to be 
gettable, but not there yet... or contain stuff that Haystack could 
parse, but not understand. What is the default behaviour of Haystack 
when it encounters a 404? what is Haystack expecting? or what would be 
better to serve from those resource dereference?

> Assuming we stick with URLs then there are two ways to overcome this
> problem a) put something behind the URL or b) make the URL to a URN, so
> Haystack doesn't try to retrieve anything so the revised data tries to
> take approach a).

I'm sorry, I can parse the above but I can't get around the logic 
behind it. Can you elaborate more on the alternatives you envision?

>> This means that if you want, say, the entire RDF schema, you access
>>
>>   http://whatever/category/schema/
>>
>> or if you want just a concept you can do
>>
>>   http://whatever/category/schema/concept
>
> People do seem to create RDF namespaces like this as well (I think DC,
> and definitely LOM seem to spread their schemas over a number of URLs)
> but it can make things ugly because
>
> - depending on the processing model used by the processor looking at
> your RDF, you may need to put a copy of the schema at every concept 
> URL,
> not just at the schema URL

I'm not sure I follow here, either. if you are looking up the concept, 
you might just want the RDFSchema for that concept, maybe with all the 
RDF references that that concept builds upon or references. Doesn't 
have to be the entire infoset.

> - RDF tools and serializations vary in how much namespace abbreviation
> they can do, and certain namespace conventions work better with certain
> serializations than others (consider my email about the CIDOC 
> vocabulary
> in N3). Using a namespace convention like this means that most tools
> will have to quote the entire namespace rather than being able to use a
> prefix and personally I think that reduces readibility. As I've noted
> before, I think abbreviating namespaces is important to make RDF
> readable, so for example as N3 is a bit better at this than RDF/XML 
> this
> generally makes N3 a bit more readable.

Well, since the schema#concept idea is much ingrained into the RDF 
methodology and its main syntax, it's easy to expect such a thing from 
the other syntaxes. it is also true that an RDF syntax *needs* a 
separator between the schema and concept... whether "/" or "#" or "?" 
is better than the other, well, it's highly debatable and this is 
probably not the right place either.

keep in mind that an HTTP method like

  GET /schema#concept HTTP/1.1

is a valid HTTP request. I'm just not sure, for example, if the Servlet 
API exposes that at all.... never tried because no client sends the 
anchor with the request.

> However, the point you are making is about really big schemas, so then
> these issues disappear because i) as you note splitting the schema up
> makes it easier to use and b) if its that big we won't be loading it
> into a text editor to read it :)

yep, the individual results would not be used for human consuption... 
but still, if you are concerned by readability of the entire spec, you 
still have a problem.

>> note that I personally dislike the # notation that RDF/XML
>> keeps using
>> so much exactly because anchors are not supposed to be driven server
>> side and prevent this kins of dynamic operations on the server side.
>> But the # notation is ingrained into the very nature of RDF/XML and
>> this is, IMO, really sad because will turn out to be a huge
>> issue later
>> on, expecially when RDF schemas get bigger and bigger.
>
> Yes, I understand it has been an issue of contention.
>
>> Anyway, as for implementation, RDF is highly relational (and that's,
>> IMO, the reason why the RDF/XML syntax feels so ugly) so it would be
>> straightforward to store it into a relational database (I
>> remember that
>> last year Eric Prud'hommeaux was working on defining those abstract
>> mappings).
>
> One of the key problems SIMILE is facing is dealing with heterogeneous,
> semi-structured data. So we could put this in a relational database, 
> but
> its awkward. One of the things I learnt early on in the project is
> DSpace use of a relational database is unusual in the library community
> - most library systems are based on hierarchical databases for this 
> very
> reason.

Makes sense. Still, I thought you were in control of this RDF data, or 
at least, you can manipulate it yourself.

Don't get me wrong, I'm a strong advocate of semi-structured repository 
and you know that, but still I think that RDF is a perfect candidate 
for relational technology.... where general XML is definately not.

> However we now have some other databases with similiar properties to
> hierarchical databases e.g. object databases, semistructured databases,
> XML databases and finally persistant RDF models (Andy tells me the SW
> community prefer the term RDF knowledge bases - ugh).

well, I think that historically, this makes sense.

> I think if we got
> a firm of consultants in to solve the SIMILE problem, they would
> probably use something like Tamino - in fact a number of other projects
> similiar to SIMILE such as TED or Gort are doing exactly this.

Can you explain the rationale bethind this? RDF is XML, XML is a tree, 
so you need a tree-oriented database? is that the syllogism in place?

> However
> one of the goals of SIMILE is to demonstrate the Semantic Web works, so
> we have to take the persistant RDF approach.

What do you mean with "persistant RDF approach".

> For example Jena uses JDBC to persist RDF models using databases like
> MySQL or Postgres, then Joseki makes those databases querable via the
> web.

That's what I was thinking. I think it makes perfect sense to use 
relational technology for RDF, given its nature.

>> I highly recommennd against this approach. if you want URIs
>> to be long
>> lasting, you can't associate them to the semantic of retrieval or
>> you'll be stuck with it forever.
>>
>>   http://whatever/category/schema/concept
>>
>> is, IMHO, much more long-lasting than anything like
>>
>>   http://whatever/lookup-service?get="schema/concept"
>>
>> Concerns should be kept separate, even if this makes the job
>> a harder.
>> In my experience, keeping concerns separate *does* pay off later on,
>> resulting in a steeper curve in the beginning, but a nicer plateau
>> later.
>
> Although Joseki uses the latter, it's a simple matter to write a 
> servlet
> so you can use the former, and then rewrite those queries and pass them
> on to Joseki, so does the distinction really matter?

As an implementation issue, no, obviously not. I also think that 
patching Joseki to do that would be so trivial to be left as an 
exercise to the reader ;-)

But from a design point of view, since you are deciding to create a 
contract that, potentially, could last for a long time, I would suggest 
to choose something like

  http://web.mit.edu/simile/schema[#|/|?]concept

rather than

  http://hplb.hpl.hp.com/joseki/lookup?get="schema/concept"

in short, choose the URI schema that is most likely to last longer.

> Also we are taking about instance data rather than schemas here, so to
> help further discussion, here are the three possibilities are
>
> i) http://whatever/collection/dataobjecttype#dataobject
>
> This is approach currently proposed. Note: the reason for including
> dataobjecttype is to generate unique URLs, rather than to place 
> metadata
> in the URL as this would a bad thing.
>
> ii) http://whatever/collection/dataobjecttype/dataobject
>
> (Stefano's preference)
>
> iii)
> http://whatever/lookup-service?get=collection/dataobjecttype/dataobject
>
> (how you would query Joseki)

What do other think?

> --
Stefano.
Attachments

application/pkcs7-signature attachment: smime.p7s
Received on Saturday, 6 December 2003 02:03:52 UTC