Some comments on the RDF Spec now that Protege 1.4 is out

The context: Yesterday, I posted a response to Klaus
talking a little bit about Sergey's api. And I said

> We decided to incorporate these into Protege because (1) The
> API is pretty okay (a little low-level but that's better than
> the opposite problem) and (2) We wanted to use a standard API
> to help us track the (we hope) evolving spec. We found it 
> quite useful. 

Dan Brickley e-mailed me asking to elaborate.

I started to, but somehow wound up writing the enclosed document 
instead. Which discusses some of the problems we encountered 
when trying to build an RDF back-end for Protege. It's not 
complete (my memory isn't perfect) but does outline some of the 
places where we felt the spec was weak or confusing. 


William Grosso




A random collection of 11 things we noticed while improving 
the RDF support in Protege (in no particular order)



1. It's very hard to support the various versions of 
	RDF / RDFS 

   Problem:

   Protege is a knowledge modelling tool. That is, we allow
   users to create classes-and-instances ("frame") style
   knowledge bases (along with various other AI'ish things
   like a constraint language). 

   It's perfectly reasonable for people to want to import
   previous RDF. But there are a gazillion namespaces out 
   there. For example:

	http://www.w3.org/TR/1999/PR-rdf-schema-19990303
	http://www.w3.org/2000/01/rdf-schema

   And the RDF API, which uses

	find (triple of resources)

   requires us to get the URI's right for the core RDF constructs.

	
   Solution:

   We wound up writing a method object that iterates through an 
   RDF/RDFS file and tries to figure out (from the uri's) what 
   version of the RDF/RDFS namespaces are being used (if you 
   download the Protege source, the method object is 
   ComputeSchemaNamespaceFromModel in the package
   edu.stanford.db.protegex.storage.rdf.load)

   This is rather clumsy. Even worse, it breaks if someone
   references two distinct versions of RDFS in the same file
   (as if, for example, someone concatenated two RDF/RDFS documents
   via cut and paste). 

2. Protege Projects don't fit very well with namespaces.

   Problem:

   Protege has the notion of "projects." Basically, a project is 
   a knowledge-base. But you can include projects in 
   other projects (so a Diabetes-specific knowledge base can 
   include a medical-terminology knowledge base). Inclusion 
   is a unidirectional ("no cycles in the inclusion graph") and 
   highly-structured way of building knowledge-bases. 

   RDF, on the other hand, has the notion of namespaces. 
   Each resource belongs to a namespace. But there are 
   structural restrictions on namespace referencing.  That
   is, "links" can go both ways between namespaces.

   Solution:

   We adopted the notion that "A project is a namespace" and
   that, therefore, all resources defined in a project belong 
   to a single namespace. 

3. More generally, the semantics of namespaces are rather 
   confusing. 
 
   Problem: 

   For example, while resources belong to namespaces, it's often 
   not clear what namespace an assertion belongs to. For 
   example, 

	<s:Class rdf:about="&a;Business">
		<s:subClassOf rdf:resource="&a;Section"/>
	</s:Class>

    refers to rdfs:Class, defines a resource called Business in 
    the namespace A, and asserts a subclass relationship.

    Where does this statement live ? Is it in a namespace somewhere ?

    What makes this more confusing are the following two facts:

 	(1) I can create a reified statement object, which does 
 	    belong in a namespace. But I have no way (AFAIK) of 
	    asserting that the reified statement actually holds.

	(2) rdfs:isDefinedBy seems to indicate that namespaces exist
	    as resources. But there's no real way to indicate 
	    membership (except by URI matching ?) and it's not clear
	    what role reified namespaces play in the general scheme 
	    of things. 

   Solution: 

   I took some advil :-)

4. People can alter arbitrary objects in RDF. 

   Problem:

   I find the extent to which this can be done a tiny bit 
   disconcerting. To wit: Suppose you define a class C. 
   Someone can import your definition, bind properties to 
   it (using rdfs:domain) and set values for those properties
   on instances. 

   This seems reasonable to me. 

   But they can also add super-classes and meta-classes to 
   the definition of C. And this is a little disconcerting.
   Adding another property is a minor alteration, a slight 
   extension of the original definition. Altering the taxonomy 
   seems a bit more drastic. 

   But the truly disconcerting part comes when people do this
   to core RDFS constructs. There are RDF files out there that
   alter the definition of rdfs:Class. And that seems like a 
   very bad idea indeed. If we're going to provide any notion of
   semantics at all in future versions of the spec, we need to 
   say "this is what we mean by class and you can't change that."

   Solution:

   In Protege, you can only edit frames that are local to the 
   project (e.g. in the above example, someone working on the 
   Diabetes knowledge base cannot alter included medical terminology).
   Since the RDFS definitions are also "included", we simply don't
   allow any editing of them at all. 

   This is somewhat against the spirit of RDF. And certainly, 
   extensing Protege in the direction of adding a property to an 
   included class seem reasonable. But the generality permitted
   by RDF just seems excessive. 


5. It's hard to know what rdfs:domain actually means. 

   Problem: 

   RDF knowledge-bases are incomplete. You can always "add"
   another domain statement to a property. Which means that, 
   in practice, section 3.1.4 of the spec is pretty vapid.

   In Protege, which attempts to provide forms for entering 
   instances of classes, this causes problems. We need to 
   know what properties are bound to which classes. And we 
   need to have a fair amount of stability there. 

   Solution:

   This is partially handled by the notion of project inclusion.
   Part of not being able to modify the definitions in an included 
   project is not being able to attach properties to classes 
   (The domain of a property is a property of the property. But 
   Protege also takes the view that adding a domain is also an 
   assertion about the class. We call this attaching the property
   to the class and it is a modification of the class definition). 
   Which means that, while editing in Protege, we simply forbid 
   certain legal RDF operations. 

   When importing RDF that was generated by some other mechanism,
   we attempt to guess the domain from the set of instances that
   take values. If you then write out the KB, we'll assert a set
   of rdfs:domain statements (and take them seriously). 


6. Range feels broken. 

   Problem:

   The idea that there can only be a single class which defines
   the range of a property, over all domains that the property is 
   bound to, is very restrictive. When you look at Protege 
   projects saved out in RDF, a significant percentage of
   the "facet" information (a non-computed guess ? Over half)
   turns out to be related to correcting the range. That is,
   statements of the form 

	"When this property takes a value for instances of this 
	class, the range is further restricted to ...."

	"This property can take, as values, instances from the 
	following list of classes."

   Solution:

   As hinted at in the problem statement, Protege uses "facets."
   We've changed the mapping since 1.3, however. What we now do
   can best be illustrated by a bit of RDFS

	<s:Class rdf:about="&a;Editor">
		<s:comment>Editors are responsible for the content of sections.</s:comment>
		<s:subClassOf rdf:resource="&a;Author"/>
		<s:subClassOf rdf:resource="&a;Employee"/>
		<a:FacetInformationProperty rdf:resource="&a;FacetInformation_Instance4"/>
		<a:FacetInformationProperty rdf:resource="&a;FacetInformation_Instance5"/>
	</s:Class>

   This is a class definition. And, as part of it, the property
   "FacetInformationProperty" takes on multiple values. Each of which 
   looks something like:

	<a:FacetInformation rdf:about="&a;FacetInformation_Instance5">
		<a:FacetNameProperty>:VALUE-TYPE</a:FacetNameProperty>
		<a:FacetValueProperty rdf:resource="&a;Section"/>
		<a:FacetSlotProperty rdf:resource="&a;sections"/>
		<a:FacetValueTypeProperty rdf:resource="&s;Class"/>
	</a:FacetInformation>

   That is, we have a property, whose domain is classes, whose values 
   are instances of "FacetInformation" which can be interpreted as 
   talking about some other property which is attached to the class. 
   In this case, it's restricting the value of the property "sections"
   to instances of the class "Section."


7. Some support for some sort of facet-like property would be
   very nice.

   Problem: 

   The above solution, of using instances of "FacetInformation" is 
   fairly ad-hoc.

   Solution:

   Right now, Protege stores out as much information in RDFS format
   as possible. Including some, in the case of range, that's slightly 
   incorrect.

   Suppose, for example, that a property can take instances from two distinct
   classes. Protege stores this information as follows:

	Facets are used to store the multiple-range information with 
	perfect precision

	rdfs:range stores a minimal common superclass of the two classes
	(note that there may be more than one such superclass; we pick
	one)

   And then, when reading the RDF back in, Protege looks for the 
   first type of information first (and, if it is found, the second 
   type is ignored). This lets us store out as much information as 
   possible using the basic RDF constructs, while not losing information
   within Protege. 
 
   But this is somewhat ugly too-- we're doing our best to preserve
   semantics within the Protege realm while simultaneously exporting 
   as much information as possible to non-Protege RDF editors. But 
   the result is that the knowledge-base that the NPRDFE gets is 
   slightly different from the one that the Protege user is creating. 

8. Subproperties are confusing.

   Problem: 

   The idea itself is reasonable. Property "husband" is a subproperty
   of property "spouse." And section 2.3.3 then tells us that if
   "Bob" is the "husband" of "Alice", then "Bob" is also the "spouse"
   of "Alice."

   What about domain and range ? There's an obvious answer for domain
   (though it's not in the spec)-- anything in the domain of a 
   subproperty must also be in the domain of a super-property.

   But range ? 3.1.3 says "A property can have at most one range property"
   Suppose we try to say:

  	The range of spouse is Person

   What is the range of husband ? Is it Person ? Or Resource ? What if
   we then assert 

	The range of husband is MalePerson

    Have we done something legal here ? Offhand, I'd say it should be
    allowed (subproperties should be allowed to narrow the range), but
    the spec doesn't say.

    In general, it's not clear what problems sub-properties solve.
    If we assume that subproperties narrow ranges and domains, then
    they do give us more precision. But it's a very small gain (as 
    far as I can see) and experience has shown that facets are often
    the level of precision that's required.

    There's also the (somewhat orthogonal) issue of "replacement." 
    If I have 

		Classes: Person, MalePerson, FemalePerson
		Properties: spouse, husband, wife

    Then I want to be able to say "if a value for husband is asserted, 
    then a different value cannot be asserted for spouse." 


    Solution:

    Protege has a flat property space. That is, Properties have the 
    rdfs:subPropertyOf property (and it can be set), but there is 
    no attempt at enforcement, or inheritance, of property values. 



9. It'd be really nice to have finer-grained Primitive types, not 
   just literals.

   Problem: 

   This has been discussed to death. We really really need a way 
   to say "the value of this range is an integer" in a canonical 
   way. 

   Solution: 

   Protege uses a combination of rdfs:range and facets to help handle
   this. Namley, if a property, for example, is integer-valued, we 
   set the range to rdfs:Literal and then store the precise type of the 
   literal in a facet. As in the following snippet.

	<s:Property rdf:about="&a;Date">
		<a:SLOT-MAXIMUM-CARDINALITY>1</a:SLOT-MAXIMUM-CARDINALITY>
		<s:comment>When the paper was published</s:comment>
		<s:domain rdf:resource="&a;Newspaper"/>
	<s:range rdf:resource="&s;Literal"/>


10. rdfs:Literal is weird.

   Problem:

   It's a class. Which corresponds to the idea of a "literal" in the 
   RDF sense. But it's not really a class in the sense of having instances,
   is it ? In what sense is this:

	<s:Literal rdf:about="&a;Tylenol">
	</s:Literal>

   a literal ? Is the string "7" really equivalent to 
   
	<s:Literal rdf:about="&s;7">
	</s:Literal>

   When we import RDF, we translate ranges of type Literal to the 
   Protege primitive type string.  
 

   Solution:

   There really isn't one. Instances of Literal are almost certainly 
   pilot-error, but they're possible. In which case, we don't 
   handle them very cleanly. 
   

11. Clearer semantics, in general, would be good

   That should be obvious by now. While I don't think we need to 
   provide model-theoretic semantics, it'd be nice to have a 
   clearer picture of just what the spec is saying.

Received on Tuesday, 18 July 2000 17:07:24 UTC