- From: William Grosso <grosso@SMI.Stanford.EDU>
- Date: Tue, 18 Jul 2000 14:07:20 -0700
- To: www-rdf-interest@w3.org
- CC: www-rdf-comments@w3.org
The context: Yesterday, I posted a response to Klaus talking a little bit about Sergey's api. And I said > We decided to incorporate these into Protege because (1) The > API is pretty okay (a little low-level but that's better than > the opposite problem) and (2) We wanted to use a standard API > to help us track the (we hope) evolving spec. We found it > quite useful. Dan Brickley e-mailed me asking to elaborate. I started to, but somehow wound up writing the enclosed document instead. Which discusses some of the problems we encountered when trying to build an RDF back-end for Protege. It's not complete (my memory isn't perfect) but does outline some of the places where we felt the spec was weak or confusing. William Grosso A random collection of 11 things we noticed while improving the RDF support in Protege (in no particular order) 1. It's very hard to support the various versions of RDF / RDFS Problem: Protege is a knowledge modelling tool. That is, we allow users to create classes-and-instances ("frame") style knowledge bases (along with various other AI'ish things like a constraint language). It's perfectly reasonable for people to want to import previous RDF. But there are a gazillion namespaces out there. For example: http://www.w3.org/TR/1999/PR-rdf-schema-19990303 http://www.w3.org/2000/01/rdf-schema And the RDF API, which uses find (triple of resources) requires us to get the URI's right for the core RDF constructs. Solution: We wound up writing a method object that iterates through an RDF/RDFS file and tries to figure out (from the uri's) what version of the RDF/RDFS namespaces are being used (if you download the Protege source, the method object is ComputeSchemaNamespaceFromModel in the package edu.stanford.db.protegex.storage.rdf.load) This is rather clumsy. Even worse, it breaks if someone references two distinct versions of RDFS in the same file (as if, for example, someone concatenated two RDF/RDFS documents via cut and paste). 2. Protege Projects don't fit very well with namespaces. Problem: Protege has the notion of "projects." Basically, a project is a knowledge-base. But you can include projects in other projects (so a Diabetes-specific knowledge base can include a medical-terminology knowledge base). Inclusion is a unidirectional ("no cycles in the inclusion graph") and highly-structured way of building knowledge-bases. RDF, on the other hand, has the notion of namespaces. Each resource belongs to a namespace. But there are structural restrictions on namespace referencing. That is, "links" can go both ways between namespaces. Solution: We adopted the notion that "A project is a namespace" and that, therefore, all resources defined in a project belong to a single namespace. 3. More generally, the semantics of namespaces are rather confusing. Problem: For example, while resources belong to namespaces, it's often not clear what namespace an assertion belongs to. For example, <s:Class rdf:about="&a;Business"> <s:subClassOf rdf:resource="&a;Section"/> </s:Class> refers to rdfs:Class, defines a resource called Business in the namespace A, and asserts a subclass relationship. Where does this statement live ? Is it in a namespace somewhere ? What makes this more confusing are the following two facts: (1) I can create a reified statement object, which does belong in a namespace. But I have no way (AFAIK) of asserting that the reified statement actually holds. (2) rdfs:isDefinedBy seems to indicate that namespaces exist as resources. But there's no real way to indicate membership (except by URI matching ?) and it's not clear what role reified namespaces play in the general scheme of things. Solution: I took some advil :-) 4. People can alter arbitrary objects in RDF. Problem: I find the extent to which this can be done a tiny bit disconcerting. To wit: Suppose you define a class C. Someone can import your definition, bind properties to it (using rdfs:domain) and set values for those properties on instances. This seems reasonable to me. But they can also add super-classes and meta-classes to the definition of C. And this is a little disconcerting. Adding another property is a minor alteration, a slight extension of the original definition. Altering the taxonomy seems a bit more drastic. But the truly disconcerting part comes when people do this to core RDFS constructs. There are RDF files out there that alter the definition of rdfs:Class. And that seems like a very bad idea indeed. If we're going to provide any notion of semantics at all in future versions of the spec, we need to say "this is what we mean by class and you can't change that." Solution: In Protege, you can only edit frames that are local to the project (e.g. in the above example, someone working on the Diabetes knowledge base cannot alter included medical terminology). Since the RDFS definitions are also "included", we simply don't allow any editing of them at all. This is somewhat against the spirit of RDF. And certainly, extensing Protege in the direction of adding a property to an included class seem reasonable. But the generality permitted by RDF just seems excessive. 5. It's hard to know what rdfs:domain actually means. Problem: RDF knowledge-bases are incomplete. You can always "add" another domain statement to a property. Which means that, in practice, section 3.1.4 of the spec is pretty vapid. In Protege, which attempts to provide forms for entering instances of classes, this causes problems. We need to know what properties are bound to which classes. And we need to have a fair amount of stability there. Solution: This is partially handled by the notion of project inclusion. Part of not being able to modify the definitions in an included project is not being able to attach properties to classes (The domain of a property is a property of the property. But Protege also takes the view that adding a domain is also an assertion about the class. We call this attaching the property to the class and it is a modification of the class definition). Which means that, while editing in Protege, we simply forbid certain legal RDF operations. When importing RDF that was generated by some other mechanism, we attempt to guess the domain from the set of instances that take values. If you then write out the KB, we'll assert a set of rdfs:domain statements (and take them seriously). 6. Range feels broken. Problem: The idea that there can only be a single class which defines the range of a property, over all domains that the property is bound to, is very restrictive. When you look at Protege projects saved out in RDF, a significant percentage of the "facet" information (a non-computed guess ? Over half) turns out to be related to correcting the range. That is, statements of the form "When this property takes a value for instances of this class, the range is further restricted to ...." "This property can take, as values, instances from the following list of classes." Solution: As hinted at in the problem statement, Protege uses "facets." We've changed the mapping since 1.3, however. What we now do can best be illustrated by a bit of RDFS <s:Class rdf:about="&a;Editor"> <s:comment>Editors are responsible for the content of sections.</s:comment> <s:subClassOf rdf:resource="&a;Author"/> <s:subClassOf rdf:resource="&a;Employee"/> <a:FacetInformationProperty rdf:resource="&a;FacetInformation_Instance4"/> <a:FacetInformationProperty rdf:resource="&a;FacetInformation_Instance5"/> </s:Class> This is a class definition. And, as part of it, the property "FacetInformationProperty" takes on multiple values. Each of which looks something like: <a:FacetInformation rdf:about="&a;FacetInformation_Instance5"> <a:FacetNameProperty>:VALUE-TYPE</a:FacetNameProperty> <a:FacetValueProperty rdf:resource="&a;Section"/> <a:FacetSlotProperty rdf:resource="&a;sections"/> <a:FacetValueTypeProperty rdf:resource="&s;Class"/> </a:FacetInformation> That is, we have a property, whose domain is classes, whose values are instances of "FacetInformation" which can be interpreted as talking about some other property which is attached to the class. In this case, it's restricting the value of the property "sections" to instances of the class "Section." 7. Some support for some sort of facet-like property would be very nice. Problem: The above solution, of using instances of "FacetInformation" is fairly ad-hoc. Solution: Right now, Protege stores out as much information in RDFS format as possible. Including some, in the case of range, that's slightly incorrect. Suppose, for example, that a property can take instances from two distinct classes. Protege stores this information as follows: Facets are used to store the multiple-range information with perfect precision rdfs:range stores a minimal common superclass of the two classes (note that there may be more than one such superclass; we pick one) And then, when reading the RDF back in, Protege looks for the first type of information first (and, if it is found, the second type is ignored). This lets us store out as much information as possible using the basic RDF constructs, while not losing information within Protege. But this is somewhat ugly too-- we're doing our best to preserve semantics within the Protege realm while simultaneously exporting as much information as possible to non-Protege RDF editors. But the result is that the knowledge-base that the NPRDFE gets is slightly different from the one that the Protege user is creating. 8. Subproperties are confusing. Problem: The idea itself is reasonable. Property "husband" is a subproperty of property "spouse." And section 2.3.3 then tells us that if "Bob" is the "husband" of "Alice", then "Bob" is also the "spouse" of "Alice." What about domain and range ? There's an obvious answer for domain (though it's not in the spec)-- anything in the domain of a subproperty must also be in the domain of a super-property. But range ? 3.1.3 says "A property can have at most one range property" Suppose we try to say: The range of spouse is Person What is the range of husband ? Is it Person ? Or Resource ? What if we then assert The range of husband is MalePerson Have we done something legal here ? Offhand, I'd say it should be allowed (subproperties should be allowed to narrow the range), but the spec doesn't say. In general, it's not clear what problems sub-properties solve. If we assume that subproperties narrow ranges and domains, then they do give us more precision. But it's a very small gain (as far as I can see) and experience has shown that facets are often the level of precision that's required. There's also the (somewhat orthogonal) issue of "replacement." If I have Classes: Person, MalePerson, FemalePerson Properties: spouse, husband, wife Then I want to be able to say "if a value for husband is asserted, then a different value cannot be asserted for spouse." Solution: Protege has a flat property space. That is, Properties have the rdfs:subPropertyOf property (and it can be set), but there is no attempt at enforcement, or inheritance, of property values. 9. It'd be really nice to have finer-grained Primitive types, not just literals. Problem: This has been discussed to death. We really really need a way to say "the value of this range is an integer" in a canonical way. Solution: Protege uses a combination of rdfs:range and facets to help handle this. Namley, if a property, for example, is integer-valued, we set the range to rdfs:Literal and then store the precise type of the literal in a facet. As in the following snippet. <s:Property rdf:about="&a;Date"> <a:SLOT-MAXIMUM-CARDINALITY>1</a:SLOT-MAXIMUM-CARDINALITY> <s:comment>When the paper was published</s:comment> <s:domain rdf:resource="&a;Newspaper"/> <s:range rdf:resource="&s;Literal"/> 10. rdfs:Literal is weird. Problem: It's a class. Which corresponds to the idea of a "literal" in the RDF sense. But it's not really a class in the sense of having instances, is it ? In what sense is this: <s:Literal rdf:about="&a;Tylenol"> </s:Literal> a literal ? Is the string "7" really equivalent to <s:Literal rdf:about="&s;7"> </s:Literal> When we import RDF, we translate ranges of type Literal to the Protege primitive type string. Solution: There really isn't one. Instances of Literal are almost certainly pilot-error, but they're possible. In which case, we don't handle them very cleanly. 11. Clearer semantics, in general, would be good That should be obvious by now. While I don't think we need to provide model-theoretic semantics, it'd be nice to have a clearer picture of just what the spec is saying.
Received on Tuesday, 18 July 2000 17:07:24 UTC