RE: Use of www-rdf-dspace for comments re: early draft note, DSpa ce H istory System

> this is okay as these issues seem to be
> more along the lines of "things that are
> broken" rather than "things that have changed,
> that we ought to be able to fix with the SW tools".

There are indeed several aspects of the current history system output that are "simply broken".  These are the primary motivator for the work described in [HSOW], thus the focus on approach (i) that you mention.  In some cases the underlying issue is missing or incomplete data, which no amount of inferencing would be able to address - these issues must be fixed in the machinery that generates the data.

The more general problem of using the SW stack to address changes that are introduced over time to  schemas and/or the way that instances are produced is an interesting focus for SIMILE, but well beyond the scope of the work described in [HSOW].  Likewise the choices among processing models.

Rather than risk severe scope creep for the work underway, I suggest that we view these problems as a basis for Use Case(s) based upon the History System.  Perhaps we should update the History System Use Case in the SIMILE Research Drivers document [SRD] to reflect this?

- Mick

[HSOW] http://web.mit.edu/simile/www/resources/history-harmony/history-statement-of-work.htm
[SRD]  http://web.mit.edu/simile/www/resources/researchDrivers-0.27/



> -----Original Message-----
> From: Butler, Mark [mailto:Mark_Butler@hplb.hpl.hp.com] 
> Sent: Wednesday, May 07, 2003 12:22 PM
> To: 'SIMILE public list'
> Subject: Use of www-rdf-dspace for comments re: early draft 
> note, DSpace H istory System
> 
> 
> 
> These comments are much more general than the other comments, 
> so apologies for this in advance. I'm sure some of the 
> following points are controversial but hopefully they will 
> create further discussion. 
> 
> One of the promises of the semantic web is that "if person A 
> writes his data in one way, and person B writes her data in 
> another way, as long as they have both used semantic web 
> tools, then we can leverage those tools to merge data from A 
> and data from B declaratively i.e. without having to rewrite 
> the software used by A or by B, and without necessitating 
> them to change their individual data sets". Before the 
> semantic web, we could have used data A with data B, but it 
> would have necessitated some changes to the data and software 
> of one or both of the parties.
> 
> However in this proposal, it seems that rather than exploring 
> the first path i.e."we have a load of data in the history 
> system format. This was similiar to Harmony and Dublin Core, 
> but since then those technologies have moved on. Let's see if 
> we can map between these different data formats by using 
> schema and ontology languages without changing any code" it 
> seems like we are taking the second by default i.e. "we have 
> a load of data in the history system format but its 
> incompatible with the latest versions of ABC and Dublin Core. 
> Let's rewrite the software that generates it so its complies 
> with their latest specifications". 
> 
> Now the problem with adopting this second approach is we 
> aren't really demonstrating the utility of the semantic web. 
> Now the history system may be sufficiently broken that it's 
> just not possible to use the first approach. Alternatively 
> the SW tools available may not yet be sufficiently advanced 
> to support the first approach. However ideally I think we 
> ought to at least explore the alternatives that try to follow 
> approach one, assuming this is possible with time constraints. 
> 
> So ideally I would like to see the descriptive note discuss 
> more alternative solutions and then evaluate those solutions. 
> At the moment it just describes a single solution. The 
> outcome of the document may still be the same, i.e. the 
> approach we use to solving the problem, but I think there is 
> a bit more thinking or how we arrived at this point could be 
> made explicit. 
> 
> I would like to illustrate this using by concentrating on is 
> the use of namespaces in the current DSpace history system. 
> If I understand the document correctly, one of the criticisms 
> made about the current DSpace History system is that is uses 
> eight different namespaces to refer to what are effectively 
> different classes. There are a number of reaons why is
> undesirable:
> - the classes all belong to the same conceptualization, or to 
> use the jargon "maintain the same ontological commitment". 
> Therefore common practice is to use a common namespace to 
> indicate this. 
> - the document notes that if the history system was to use 
> certain well known schemas, e.g. Dublin Core and ABC, then it 
> is possible that processors might know something about those 
> schemas and be able to process this information.
> 
> However, a lot of articles that discuss why we need the 
> semantic web describe how the SW will allow things to work 
> together automagically. My guess is the enabling technology 
> for this is automated discovery, by which I mean some 
> mechanism that a processor can use to automatically configure 
> itself so it can process a document or model. So next I will 
> outline several different approaches for automated discovery 
> (or "processing models") that can be applied to RDF, and then 
> consider how they might be used to solve some of the issues 
> outlined in the document. 
> 
> (Schema Discovery via namespace processing model)
> 
> "The processor gets a piece of RDF, inspects the namespaces 
> and it tries to retrieve the schema from the namespace. If it 
> can retrieve the schema, it processes it and is able to map 
> the DSpace information into another schema that it is 
> familiar with e.g. Dublin Core."
> 
> Now I have a lot of sympathy with the processing model (PM) 
> above, but in fact it turns out the PM above is quite 
> controversial because namespaces do not indicate schemas. 
> Just because a piece of RDF defines a namespace with a URI 
> that uses HTTP, this doesn't mean HTTP can be used to 
> retrieve an RDF schema that gives you more information about 
> that RDF. This is because 
> 
> - RDF does not formally require this. We could overcome this 
> by formally requiring it for our RDF application (by 
> application I mean a usage of RDF, rather than a piece of 
> software) but how does the generalised processor know we've done this?
> - if there is nothing at that URI, the only way the processor 
> will determine this is via a timeout which will cause any 
> requests that are invoking the processor to fail also
> - it is not clear which resource you should have at the HTTP 
> address e.g. an XML Schema, an RDF Schema, an OWL ontology etc. 
> 
> See [2] and [3] for related discussions. 
> 
> I observe there is some disagreement amongst the SW community 
> on this e.g. in conversation with Tim Berners-Lee it seems 
> his implicit assumption is that a namespace should point to a 
> schema whereas I remember Dan Connolly expressing the opinion 
> that RDF must be processable as is, without the schema. 
> Furthermore it seems to me the recent RDFcore datatyping 
> decision that datatypes must be declared explicitly in the 
> RDF instance data, rather than defined in the associated RDF 
> schema, was arrived at from the latter viewpoint. There have 
> also been proposals about how to overcome this e.g. 
> 
> - use URN's if a namespace just indicates identity, whereas 
> use HTTP if it indicates identity and points to additional resources. 
> - the RDF graph could define "processing instructions" that 
> indicate how to process it. CC/PP does this for some things 
> but not all as it provides a method for subgraph inclusion 
> called defaults. Of course, applications defining their own 
> "processing instructions" would not be sufficient, as these 
> processing instructions would need to be standardised in 
> order to support automated discovery. 
> 
> Let's try to concretise this with some other processing models:
> 
> (Manifest discovery via namespace processing model)
> 
> "The processor receives a piece of RDF and it inspects the 
> namespace used. The processor also knows what data languages* 
> it supports e.g. RDF Schema, OWL, XForms, XML Schema, XSLT 
> etc. It tries to retrieve information from the HTTP address 
> indicated by the namespace, performing a content negotiation 
> so that it retrieves all the resources that it can process. 
> This solves the problem of needing to know what type resource 
> should be at the namespace URI as the processor retrieves any 
> that are useful to it. The processor then uses these 
> resources to try to help process the RDF."
> 
> (* data languages probably isn't the best term here. This is 
> similiar to Rick Jeliffe's proposal in [3])
> 
> (An aside: This processing model is probably a bit 
> controversial as it admits XML based languages to the SW 
> stack, and the SW folks often argue that we need to replace 
> all the XML in the world with RDF. I disagree with this, 
> especially as currently we have a bunch of tools that are 
> useful that use XML and a bunch of tools that use RDF. 
> Re-engineering all the XML tools to be written in RDF will 
> take years, so let's see if we can tweak them so they work together.) 
> 
> (Schema discovery via processing instruction processing model)
> 
> "the processor receives a piece of RDF, and it inspects the 
> RDF model for RDF statements using the swpm (semantic web 
> processing model namespace). These statements give processing 
> instructions about how to the process the model. The 
> processor follows these instructions e.g. retrieves the 
> relevant schemas. It then uses this information to process 
> the RDF as outlined in the previous PMs".
> 
> (Two asides: first as is probably becoming obvious now, there 
> is great potential for these processing models to be inter-mixed. 
> 
> Second we can use the processing instruction processing model 
> and manifests to leverage XML for the SW, as if we just add 
> processing instructions to XML we can keep our data in XML, 
> but the processing instruction points at a manifest that 
> includes an XSLT stylesheet that converts the XML to RDF/XML, 
> so the data is now SW compatible. Via manifests, it can also 
> retrieve a large bunch of other resources.)
> 
> (Schema discovery via namespace with transport dependence 
> processing model)
> 
> "when the processor receives a piece of RDF, it inspects the 
> namespace used. If the namespace starts with HTTP, this 
> indicates a resource is retrievable from that address. If it 
> starts with another transport, e.g. URN, then it regards the 
> namespace as simply defining identity. In the event of a 
> retrievable resource it retrieves it and uses it to process 
> as necessary"
> 
> Other reasons why we need automated discovery
> 
> One of Tim Berners-Lee's dictums for good design on the web 
> is "good URIs don't change". This causes problems. Let's say 
> I create a schema but its wrong e.g. its not compliant with a 
> "clarification" in RDF. However I've published it, so I can't 
> fix it because I can't change the contents of the URI. So 
> what are my options? I can republish all my data and schema 
> so it is correct using a new namespace. Alternatively I can 
> just say to the people with the RDF processors "it's your 
> problem, you deal with it". Consider another problem: lets 
> say a new format comes along, e.g. which ends up dominating 
> the user base. I may want to add information to my schema 
> that explain how to map my data to that format. However I 
> can't get at the schema of the new format (because I don't 
> own it), and I can't change the contents of the URI to change 
> my schema. In light of these issues, perhaps this advice is 
> right? However there is a whole host of issues here that is 
> probably beyond the scope of this document.  The point is say 
> we fix the system as we outline, but then a new version of 
> Dublin Core or ABC is released. Do we have to recode the 
> history system again? At the moment, yes because we can't add 
> additional data to the schema once we've created it. Due to 
> the "good URIs don't change" advice, it's now cast in stone. 
> This is why we need to consider the other processing models. 
> Another alternative is to use 
> dereferencing URIs as PURL does:
> 
> (Schema discovery via dereferenced namespace)
> 
> The processor recieves a piece of RDF, and inspects the 
> namespace used. It queries this namespace with an 
> intermediate server that stores the dereferences. The server 
> could be identified via the namespace e.g. as in PURL or some 
> other approach could be used. Thr dereference points to a 
> particular schema, optionally on another server. This server 
> could contain several dated versions of the schema, but the 
> derefence just points to the most up to date one. 
> 
> Then if we want to update the schema so it has additional 
> information that maps it onto a newly released version of 
> Dublin Core, we can do so because the contents of URIs never 
> change, but the contents of the dereferenced URIs do. Or to 
> put it another way, I think TBLs dictum is to draconian: we 
> may have URIs on the web that change and those that don't, we 
> just need an explicit way of distinguishing between those two 
> types of URIs. 
> 
> OWL's processing model
> 
> In OWL, the processor loads an OWL ontology that can use 
> includes to load other OWL ontologies and it then has data 
> about those ontologies. But that's it, there's no way to 
> automatically load ontologies on demand, it has to be 
> explicitly configured. Now I may be wrong here as I'm not an 
> expert on OWL, but my guess is this design decision is 
> deliberate because you can't just combine ontologies 
> arbitarily, you need to do consistency checks first. 
> Typically this is done at ontology creation time (see OilEd) 
> as there is a large processing overhead associated with this. 
> 
> Now of course in RDF you don't need to do these consistency 
> checks prior to combination because the model theory avoids 
> inconsistencies. Of course OWL may change in the future, but 
> this is another processing model. In fact, it's the model I 
> use in DELI, because we found that most publishing RDF 
> schemas just got them totally wrong, and the people producing 
> instance data just seemed to make up namespaces as they went 
> along, so instead we loaded all the information we needed up 
> front, and also defined some equivalences so we could deal 
> with the most commonly encountered mistakes in the instance data e.g.
> 
> (start-up schema load processing model)
> 
> "The processor loads a set of schemas at start-up time. When 
> it receives RDF, it makes a best attempt to process it. If it 
> recognises it via the startup schemas, it processes it. If 
> not, it tries to process it but at the end of the day  if the 
> schema is not recognised responsiblity passes to the 
> application sitting on the processor. However it is fairly 
> easy to reconfigure the processor to deal with new schemas, 
> it's just a matter of changing some kind of configuration 
> script. This allows whoever is configuring the processor to 
> do some kind of "quality control" on the schemas."
> 
> Okay, so I've proposed a lot of ideas here. So how does this 
> map back onto the history document? Well we can solve the 
> "usage of external schemas", "duplicate properties", "usage 
> of outdated harmony properties" in a number of ways:
> 
> i) we modify the code to change the namespace to the official 
> DC and ABC namespaces and to use the updated harmony 
> properties i.e. the approach proposed in the document. 
> 
> ii) add a processing instruction to the RDF generated by the 
> history system. Of course the processing instructions need to 
> be standarised, but that's a side-issue. This processing 
> instruction points at a piece of RDFS or OWL that resolves 
> the three issues above. Let's call this the "update schema". 
> 
> iii) the processor could look up any of the namespaces used 
> in a "schema namespace server". This server would know that 
> these namespaces are defined in the "update schema", so it 
> returns that to the processor. 
> 
> iv) the processor uses start-up schema loading, so we just 
> make the "update schema" available and it is then the 
> responsibility of the person configuring the processor to add 
> that schema to the start-up configuration.
> 
> So the history system document has decided to go with 
> approach i). I think with approaches ii), iii), and iv) there 
> are two questions we can ask:
> 
> a) is RDFS or OWL sufficiently rich so that we can solve the 
> "usage of external schemas", "duplicate properties" and 
> "usage of outdated harmony properties" issues? 
> 
> My guess is OWL can probably do the first two, although with 
> RDFS it is harder as RDFS cannot define equivalences only 
> subclasses and subproperties. Arguably these are not the 
> same, as they are not reflexive. I'm not so sure about what 
> the outdated harmony properties involve though, so I can't 
> make a call on whether this can be solved with OWL or not. 
> 
> b) assuming we can map between the data formats 
> declaratively, what are the pros and cons of approaches i), 
> ii), iii) and iv)? As a result of this, which is the best approach?
> 
> (I guess this is a general question for the RDF community). 
> 
> However, this leaves us with seven other issues (lack of type 
> information, empty or missing properties, expressions of 
> qualified properties, relationships expressed using local 
> identifiers, usage of local URIs, formatted text in property 
> values, and references to non-existant states) that it is not 
> possible to solve this way, but this is okay as these issues 
> seem to be more along the lines of "things that are broken" 
> rather than "things that have changed, that we ought to be 
> able to fix with the SW tools". 
> 
> [2] 
> http://www.intertwingly.net/stories/2002/09/09/gentleIntroduct
ionToNamespace
s.html
[3] http://www.xml.com/pub/a/2001/01/10/rddl.html 

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Wednesday, 7 May 2003 13:56:35 UTC