Working with RDF from Seaborne, Andy on 2003-05-09 (www-rdf-dspace@w3.org from May 2003)

From: Seaborne, Andy <Andy_Seaborne@HPLB.HPL.HP.COM>
Date: Fri, 9 May 2003 15:30:13 +0100
To: "Butler, Mark" <Mark_Butler@HPLB.HPL.HP.COM>, "'SIMILE public list'" <www-rdf-dspace@w3.org>
Message-ID: <5E13A1874524D411A876006008CD059F06C236EF@0-mail-1.hpl.hp.com>
Mark,

One of the things that any web system has to cope with is the fact that the
web (semantic or otherwise) is sufficiently large that it doesn't all work
at once. Any system that depends on others also has to take into account
that there will be temporary problems.

So the ground rules are: ontologies may not be perfect; connectivity is not
guaranteed; systems do the best they can in the circumstances.  There is
value in working with information from other systems but it has
implications.

Since into N3:

<> rdfs:seeAlso
   <http://lists.w3.org/Archives/Public/www-rdf-interest/2003May/0026.html>


In RDF, lack of a schema for a namespace does not stop the system doing
anything with it.

Example:

@prefix x: <http://example.org/WebLibraryTerms#> .

<http://dspace.org/>
      rdf:type x:LibrarySite ;
      x:administrator "Mick" .


and the WebLibraryTerms namespace isn't one that a system knows about.  What
does it do?  It can decide to read from the namespace URL; it can choose not
to.  While good style says that the schema should be available at the
namespace place indicated, it may not be.

So when asked the query 'what are the websites run by "Mick"' the system
will return nothing.  It does not know, if it has not read in the schema:

    x:LibrarySite rdfs:subClassOf knownSchema:WebSite .

unless it reads in the schema.  The answer to queries on the semantic aren't
"yes" and "no", they are "yes" and "don't know".  There is no global
processing model or global consistency.  There are local decisions on what
to do about things.  

Maybe some community using SIMILE does know something about WebLibraryTerms:
it can ask for everything known about http://dspace.org/ and the server can
ship it useful stuff.  The fact the server does not fully understand all the
implications of the data isn't important.  Later, the community can ask the
SIMILE system to install WebLibraryTerms so they can do their searches on
the server side if it does not automatically read it or log the fact that an
unknown namespace was used a lot and the admins have already decided to get
it.

[[If I visit a website today, and get a webpage and the image logo for
"graded good by SomeOrg-I-trust" does not display properly, then I don't
know that fact.  The page is still useful even though I don't know it is
graded good.]]

So a key question: how often do new schemas change?  how often do unknown
schemas turn up?  If a new schema arises and is important to some community
of SIMILE uses, they ask the system to use that schema.  It may not happen
immediately and it may involve a person doing some configuration, but it
does deliver useful value to people in the presence of a less than perfect
global semantic web.


Specifically for the history store:

I would expect it to have a site cache of schema/ontologies, indexed by
namespace.  Boringly practical - if some schema are deemed "bad" (unhelpful)
at a site, they just don't use it.  A cache is prudent because even if the
namespace does reside at its namespace URL, using HTTP, it may be
unreachable just at the moment it is needed.  Schemas are slow changing
things so using a cached copy - even one fixed if the master copy is
trivially broken (a local decision again).  The use of the cache is a local
choice.  I would like it to read in any namespaces it encounters but it
isn't necessary that it do so.

On the query side, there will be a few (2? 3?) key schema that the history
has to deal with and is tested against these schemas.  On the data input
side, the validation process can fetch new schemas as encountered.  We are
not a fully magic world.

If a new schema is encountered, say it has a property foo:articleTitle that
is equivalent to dc:title, then until the system uses a rule that these are
equivalent it treats them as different.

Is foo:articleTitle really, truly, exactly equivalent to dc:title?  It
depends.  It depends on who is asking, it depends on what they want the
information for.  A good system admits these alternatives and does the best
it can in the circumstances.

In many ways, I don't see this as specific to the semantic web.  It is a
consequence of being a large federation, not a centrally managed system.
Because SIMILE has an ingest and validation process, I hope that is seen by
other systems as a high quality source of information.  If it is perceived
as such it will get used; if it is not seen as such, it will not get used.

	Andy


-----Original Message-----
From: Butler, Mark [mailto:Mark_Butler@hplb.hpl.hp.com] 
Sent: 7 May 2003 17:22
To: 'SIMILE public list'
Subject: Use of www-rdf-dspace for comments re: early draft note, DSpace H
istory System



These comments are much more general than the other comments, so apologies
for this in advance. I'm sure some of the following points are controversial
but hopefully they will create further discussion. 

One of the promises of the semantic web is that "if person A writes his data
in one way, and person B writes her data in another way, as long as they
have both used semantic web tools, then we can leverage those tools to merge
data from A and data from B declaratively i.e. without having to rewrite the
software used by A or by B, and without necessitating them to change their
individual data sets". Before the semantic web, we could have used data A
with data B, but it would have necessitated some changes to the data and
software of one or both of the parties.

However in this proposal, it seems that rather than exploring the first path
i.e."we have a load of data in the history system format. This was similiar
to Harmony and Dublin Core, but since then those technologies have moved on.
Let's see if we can map between these different data formats by using schema
and ontology languages without changing any code" it seems like we are
taking the second by default i.e. "we have a load of data in the history
system format but its incompatible with the latest versions of ABC and
Dublin Core. Let's rewrite the software that generates it so its complies
with their latest specifications". 

Now the problem with adopting this second approach is we aren't really
demonstrating the utility of the semantic web. Now the history system may be
sufficiently broken that it's just not possible to use the first approach.
Alternatively the SW tools available may not yet be sufficiently advanced to
support the first approach. However ideally I think we ought to at least
explore the alternatives that try to follow approach one, assuming this is
possible with time constraints. 

So ideally I would like to see the descriptive note discuss more alternative
solutions and then evaluate those solutions. At the moment it just describes
a single solution. The outcome of the document may still be the same, i.e.
the approach we use to solving the problem, but I think there is a bit more
thinking or how we arrived at this point could be made explicit. 

I would like to illustrate this using by concentrating on is the use of
namespaces in the current DSpace history system. If I understand the
document correctly, one of the criticisms made about the current DSpace
History system is that is uses eight different namespaces to refer to what
are effectively different classes. There are a number of reaons why is
undesirable:
- the classes all belong to the same conceptualization, or to use the jargon
"maintain the same ontological commitment". Therefore common practice is to
use a common namespace to indicate this. 
- the document notes that if the history system was to use certain well
known schemas, e.g. Dublin Core and ABC, then it is possible that processors
might know something about those schemas and be able to process this
information.

However, a lot of articles that discuss why we need the semantic web
describe how the SW will allow things to work together automagically. My
guess is the enabling technology for this is automated discovery, by which I
mean some mechanism that a processor can use to automatically configure
itself so it can process a document or model. So next I will outline several
different approaches for automated discovery (or "processing models") that
can be applied to RDF, and then consider how they might be used to solve
some of the issues outlined in the document. 

(Schema Discovery via namespace processing model)

"The processor gets a piece of RDF, inspects the namespaces and it tries to
retrieve the schema from the namespace. If it can retrieve the schema, it
processes it and is able to map the DSpace information into another schema
that it is familiar with e.g. Dublin Core."

Now I have a lot of sympathy with the processing model (PM) above, but in
fact it turns out the PM above is quite controversial because namespaces do
not indicate schemas. Just because a piece of RDF defines a namespace with a
URI that uses HTTP, this doesn't mean HTTP can be used to retrieve an RDF
schema that gives you more information about that RDF. This is because 

- RDF does not formally require this. We could overcome this by formally
requiring it for our RDF application (by application I mean a usage of RDF,
rather than a piece of software) but how does the generalised processor know
we've done this?
- if there is nothing at that URI, the only way the processor will determine
this is via a timeout which will cause any requests that are invoking the
processor to fail also
- it is not clear which resource you should have at the HTTP address e.g. an
XML Schema, an RDF Schema, an OWL ontology etc. 

See [2] and [3] for related discussions. 

I observe there is some disagreement amongst the SW community on this e.g.
in conversation with Tim Berners-Lee it seems his implicit assumption is
that a namespace should point to a schema whereas I remember Dan Connolly
expressing the opinion that RDF must be processable as is, without the
schema. Furthermore it seems to me the recent RDFcore datatyping decision
that datatypes must be declared explicitly in the RDF instance data, rather
than defined in the associated RDF schema, was arrived at from the latter
viewpoint. There have also been proposals about how to overcome this e.g. 

- use URN's if a namespace just indicates identity, whereas use HTTP if it
indicates identity and points to additional resources. 
- the RDF graph could define "processing instructions" that indicate how to
process it. CC/PP does this for some things but not all as it provides a
method for subgraph inclusion called defaults. Of course, applications
defining their own "processing instructions" would not be sufficient, as
these processing instructions would need to be standardised in order to
support automated discovery. 

Let's try to concretise this with some other processing models:

(Manifest discovery via namespace processing model)

"The processor receives a piece of RDF and it inspects the namespace used.
The processor also knows what data languages* it supports e.g. RDF Schema,
OWL, XForms, XML Schema, XSLT etc. It tries to retrieve information from the
HTTP address indicated by the namespace, performing a content negotiation so
that it retrieves all the resources that it can process. This solves the
problem of needing to know what type resource should be at the namespace URI
as the processor retrieves any that are useful to it. The processor then
uses these resources to try to help process the RDF."

(* data languages probably isn't the best term here. This is similiar to
Rick Jeliffe's proposal in [3])

(An aside: This processing model is probably a bit controversial as it
admits XML based languages to the SW stack, and the SW folks often argue
that we need to replace all the XML in the world with RDF. I disagree with
this, especially as currently we have a bunch of tools that are useful that
use XML and a bunch of tools that use RDF. Re-engineering all the XML tools
to be written in RDF will take years, so let's see if we can tweak them so
they work together.) 

(Schema discovery via processing instruction processing model)

"the processor receives a piece of RDF, and it inspects the RDF model for
RDF statements using the swpm (semantic web processing model namespace).
These statements give processing instructions about how to the process the
model. The processor follows these instructions e.g. retrieves the relevant
schemas. It then uses this information to process the RDF as outlined in the
previous PMs".

(Two asides: first as is probably becoming obvious now, there is great
potential for these processing models to be inter-mixed. 

Second we can use the processing instruction processing model and manifests
to leverage XML for the SW, as if we just add processing instructions to XML
we can keep our data in XML, but the processing instruction points at a
manifest that includes an XSLT stylesheet that converts the XML to RDF/XML,
so the data is now SW compatible. Via manifests, it can also retrieve a
large bunch of other resources.)

(Schema discovery via namespace with transport dependence processing model)

"when the processor receives a piece of RDF, it inspects the namespace used.
If the namespace starts with HTTP, this indicates a resource is retrievable
from that address. If it starts with another transport, e.g. URN, then it
regards the namespace as simply defining identity. In the event of a
retrievable resource it retrieves it and uses it to process as necessary"

Other reasons why we need automated discovery

One of Tim Berners-Lee's dictums for good design on the web is "good URIs
don't change". This causes problems. Let's say I create a schema but its
wrong e.g. its not compliant with a "clarification" in RDF. However I've
published it, so I can't fix it because I can't change the contents of the
URI. So what are my options? I can republish all my data and schema so it is
correct using a new namespace. Alternatively I can just say to the people
with the RDF processors "it's your problem, you deal with it". Consider
another problem: lets say a new format comes along, e.g. which ends up
dominating the user base. I may want to add information to my schema that
explain how to map my data to that format. However I can't get at the schema
of the new format (because I don't own it), and I can't change the contents
of the URI to change my schema. In light of these issues, perhaps this
advice is right? However there is a whole host of issues here that is
probably beyond the scope of this document.  The point is say we fix the
system as we outline, but then a new version of Dublin Core or ABC is
released. Do we have to recode the history system again? At the moment, yes
because we can't add additional data to the schema once we've created it.
Due to the "good URIs don't change" advice, it's now cast in stone. This is
why we need to consider the other processing models. Another alternative is
to use 
dereferencing URIs as PURL does:

(Schema discovery via dereferenced namespace)

The processor recieves a piece of RDF, and inspects the namespace used. It
queries this namespace with an intermediate server that stores the
dereferences. The server could be identified via the namespace e.g. as in
PURL or some other approach could be used. Thr dereference points to a
particular schema, optionally on another server. This server could contain
several dated versions of the schema, but the derefence just points to the
most up to date one. 

Then if we want to update the schema so it has additional information that
maps it onto a newly released version of Dublin Core, we can do so because
the contents of URIs never change, but the contents of the dereferenced URIs
do. Or to put it another way, I think TBLs dictum is to draconian: we may
have URIs on the web that change and those that don't, we just need an
explicit way of distinguishing between those two types of URIs. 

OWL's processing model

In OWL, the processor loads an OWL ontology that can use includes to load
other OWL ontologies and it then has data about those ontologies. But that's
it, there's no way to automatically load ontologies on demand, it has to be
explicitly configured. Now I may be wrong here as I'm not an expert on OWL,
but my guess is this design decision is deliberate because you can't just
combine ontologies arbitarily, you need to do consistency checks first.
Typically this is done at ontology creation time (see OilEd) as there is a
large processing overhead associated with this. 

Now of course in RDF you don't need to do these consistency checks prior to
combination because the model theory avoids inconsistencies. Of course OWL
may change in the future, but this is another processing model. In fact,
it's the model I use in DELI, because we found that most publishing RDF
schemas just got them totally wrong, and the people producing
instance data just seemed to make up namespaces as they went along, so
instead we loaded all the information we needed up front, and also defined
some equivalences so we could deal with the most commonly encountered
mistakes in the instance data e.g.

(start-up schema load processing model)

"The processor loads a set of schemas at start-up time. When it receives
RDF, it makes a best attempt to process it. If it recognises it via the
startup schemas, it processes it. If not, it tries to process it but at the
end of the day  if the schema is not recognised responsiblity passes to the
application sitting on the processor. However it is fairly easy to
reconfigure the processor to deal with new schemas, it's just a matter of
changing some kind of configuration script. This allows whoever is
configuring the processor to do some kind of "quality control" on the
schemas."

Okay, so I've proposed a lot of ideas here. So how does this map back onto
the history document? Well we can solve the "usage of external schemas",
"duplicate properties", "usage of outdated harmony properties" in a number
of ways:

i) we modify the code to change the namespace to the official DC and ABC
namespaces and to use the updated harmony properties i.e. the approach
proposed in the document. 

ii) add a processing instruction to the RDF generated by the history system.
Of course the processing instructions need to be standarised, but that's a
side-issue. This processing instruction points at a piece of RDFS or OWL
that resolves the three issues above. Let's call this the "update schema". 

iii) the processor could look up any of the namespaces used in a "schema
namespace server". This server would know that these namespaces are defined
in the "update schema", so it returns that to the processor. 

iv) the processor uses start-up schema loading, so we just make the "update
schema" available and it is then the responsibility of the person
configuring the processor to add that schema to the start-up configuration.

So the history system document has decided to go with approach i). I think
with approaches ii), iii), and iv) there are two questions we can ask:

a) is RDFS or OWL sufficiently rich so that we can solve the "usage of
external schemas", "duplicate properties" and "usage of outdated harmony
properties" issues? 

My guess is OWL can probably do the first two, although with RDFS it is
harder as RDFS cannot define equivalences only subclasses and subproperties.
Arguably these are not the same, as they are not reflexive. I'm not so sure
about what the outdated harmony properties involve though, so I can't make a
call on whether this can be solved with OWL or not. 

b) assuming we can map between the data formats declaratively, what are the
pros and cons of approaches i), ii), iii) and iv)? As a result of this,
which is the best approach?

(I guess this is a general question for the RDF community). 

However, this leaves us with seven other issues (lack of type information,
empty or missing properties, expressions of qualified properties,
relationships expressed using local identifiers, usage of local URIs,
formatted text in property values, and references to non-existant states)
that it is not possible to solve this way, but this is okay as these issues
seem to be more along the lines of "things that are broken" rather than
"things that have changed, that we ought to be able to fix with the SW
tools". 

[2]
http://www.intertwingly.net/stories/2002/09/09/gentleIntroductionToNamespace
s.html
[3] http://www.xml.com/pub/a/2001/01/10/rddl.html 

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Friday, 9 May 2003 10:38:24 UTC