Re: Comments on the Stanford RDF API

-- This is a call for opinions. Please contribute. --

Brian,

thanks a lot for your detailed experience report and improvement
suggestions. I'd like to discuss some of the extensions you propose.
Below I'm listing some of the possible action plans. Please comment on
them
(anyone interested is welcome to contribute!). I'm also touching some of
the general RDF issues.

"McBride, Brian" wrote:
> 
> In general, I found the API pretty easy to work with and a good basis
> for development.
> I've been working mainly with a database back end with some work on a low
> level editor and a schema validator. I've split comments into three groups,
> general issues, those motivated by database specific issues and stylistic.
> Its pretty encouraging that the database specific section is so small.

Sounds encouraging! Are you planning to open source some of your work?

> o Issue: Vector.indexAt() does not work for a vector of RDFNode.
> 
>   Reason:  RDFNode.equals(RDFNode n) should be RDFNode.equals(Object o).

Hmm, I tried to be careful about that, the definition of
RDFNodeImpl.equals is:

  public boolean equals (Object that) {

    if(that instanceof Digestable) {
      return DigestUtil.equal(getDigest(),
((Digestable)that).getDigest());
    }

    return label.equals( ((RDFNode)that).getLabel() );
  }

I'm not sure what causes the trouble for you...

> o Issue: No declared exceptions
> 
>   Discussion:  I've got a host of exception conditions that can arise
>          there needs to be a way to report them.  Currently I'm using
>          runtime exceptions so they don't have to be declared, but I'd
>          prefer to be able to declare them.

Good point. Plan: introduce ModelException. ImmutableModelException can
be thrown if
the model is not modifiable (subclass of ModelException). Are there more
exception subtypes that can be generally useful?

> o Issue:  Namespace names lost on import
> 
>   Reason: When I'm importing an RDF serialization into the database, I'm
>         passed the full URI.  It is not possible in general, to parse that
>       URI and pick out the namespace component.  It is better to retain
>       the namespace component which can then be used for better user
>       presentation in an editor and for better serialization of the model.
> 
>   Remedy: Add methods Model.createResource(String nsName, String roName),
>           String RDFResource.nsName() and String RDFResource.roName().

Unlike UML, namespaces are not explicitly present in the RDF model. They
are merely used to make resource identifiers unique. Namespace shortcuts
in XML are "syntactic sugar". As long as the parsed models are
equivalent, it does not matter what namespaces to use. For example, the
following two serializations are interchangeable:

(1)

<rdf:RDF
xmlns:s="http://www.omg.org/uml/1.3/Behavioral_Elements.State_Machines.">

  <s:StateMachine>
    <s:transition> ... </s:transition>
  </s:StateMachine>

</rdf:RDF>

(2)

<rdf:RDF xmlns:s="http://www.omg.org/uml/1.3/">

  <s:Behavioral_Elements.State_Machines.StateMachine>
    <s:Behavioral_Elements.State_Machines.transition> ...
</s:Behavioral_Elements.State_Machines.transition>
  </s:Behavioral_Elements.State_Machines.StateMachine>

</rdf:RDF>

Sometimes it is nice to be able to "extract" namespaces from URIs for
more compact/legible serialization. For that, any prefix can be used as
long as it occurs reasonably often and the suffixes do not contain
illegal characters.

> o Issue:  Current API has no way to set a namespace prefix
> 
>   Reason: When displaying URI's, and when serialising it would be good
>         to display a namespace prefix that is meaningful to a human.
> 
>   Remedy: Add method Model.setNsPrefix(String nsName, String prefix)

I think this is not needed...
 
> o Issue: New query methods.
> 
>   Reason:       When generating an RDF serialization, it is convenient to be
> able to
>         list all the namespaces used in a model so they can be output at the
> head
>         of the serialization.  For my RDF editor, I want to be able to list
> all
>         all the unique subjects in the model, and I'd like to use a database
>       query rather than troll through all the statements and pick them out
>         myself.  See also the stylistic note below.
> 
>   Remedy:  Add methods:
>                 
>                 RDFEnum Model.namespaces();
>                 RDFEnum Model.subjects();
>                 RDFEnum Model.predicates();
>                 RDFEnum Model.objects();

Ad RDFEnum Model.namespaces():

For very large datasets, even this approach may not be appropriate. If
you one has to serialize a billion statements from a database, namespace
information may not fit into main memory. I'd suggest to read subsets of
statements and generate partial serializations. So, every time you can
collect namespaces prefixes from the given subset in memory before
dumping it. One small problem that I see with this approach is that XML
requires a single "top" element. So you cannot have a list of rdf:RDF
tags. According to the RDF M&S spec, one may not nest rdf:RDF either.
You can still have

<dummyTag>
  <rdf:RDF xmlns:s="<ns1>"
   ...
  </rdf:RDF>
  <rdf:RDF xmlns:s="<ns2>"
   ...
  </rdf:RDF>
</dummyTag>

SiRPAC contained in the API can parse this.

As to subjects(), predicates(), objects():

If adding something, I'd rather provide:

	Enumeration Model.getResources();
	Enumeration Model.getLiterals();

Why do you need to distinguish between subjects, predicates and objects?
Can you explain why any of the above methods might be needed in some
more detail?

 
> o Issue: Id for RDFNodes
> 
>   Reason: My reading of the spec is that a model can contain anonymous
>         resources.  I'm aware that you disagree with this.  In my
> implementation
>         an anonymous resource has an empty string as its URI, so I need some
>         other way to distinguish them.  For now, I've added the following
> function
>         to RDFNode.  This is really just a placeholder for now, because I'd
> really
>         like to get this issues of anon resources cleared up.
> 
>   Remedy: long Resource.getId() returns an integer unique within this
> database.

Using integers to manipulate resources/literals is a very valid
approach. If may be the only feasible one if you have billions of
statements stored persistently. I'm thinking of adding the following
interface to support this:

interface IntegerIdentifiable {

   long getIntegerID();
}

If your database uses integers internally, you don't even need to load
string URIs into memory until you have to serialize the model. Your
custom Resource and Literal implementations could implement the above
interface. Makes sense? Can you think of a better naming than the one
above?

> 
> o Issue:  What to do with model.setSourceURI() and getSourceURI().
> 
>   Reason: I have some test files lying around.  When I import one of these,
>         SiRPAC is calling this method with an arguement like
>         "c:\temp\rdfschema.rdf".  I'm just not sure how useful this is or
> what
>         to do with it.  Is this intended to be the URI for the model?  Is it
> a
>         property to be attached to the model URI?

This is a hint to the application, it is not part of the RDF model. I
guess, the only place I used getSourceURI is in the serializer
(org.w3c.rdf.implementation.syntax.sirpac.SiRS). Without this knowledge,
serialization of "genid"s looked very ugly. However, I modified the
serializer so that it can handle such cases gracefully (by creating an
XML entity for shortcuts) even without having getSourceURI. From current
perspective this methods seem obsolete.


> o Issue: Model.create() does not specify URI.
> 
>   Reason:  See my separate note, but I don't think we are far apart on this.
> A
>         model may have URI, so I'd expect Model.create() to take a URI
> parameter,
>         which may be null or empty if the model is anonymous.
> 
>   Remedy:  modify Model.create() to be Model.create(String URI).

Currently, getURI on a model returns a digest-based URI of the model.
That's model's identity. This URI cannot be set or changed, similarly to
URIs of Resources. Why do you need this? Maybe, this is a use case for
setSourceURI()?

> DATABASE SPECIFIC
> =================
> 
> o Issue: I need to free up resources when a model or an enumerator is no
>         longer in use.
> 
>   Reason:  My database implementation allocates database resources such
>         as connections, cursors and views.  I need to be able to release
> these,
>       preferably as soon as the application is finished with them.  If I
> wait
>         till the garbage collector runs, I tend to run out of cursors even
> on
>         simple applications, and I can't rely on finalizers being run when
> an
>         application terminates.
> 
>   Remedy: Add close() method to Model and the enumeration returned by
>         Model.elements.

Right, currently, there is no provision for persistence in the API. I'm
planning to add the following interface:

interface PersistentModel {

  /** return true if in-memory model is not in synch with the persistent
store */
  boolean isDirty();

  /** synchronizes persistent store with in-memory model */
  void checkpoint() throws PersistentModelException;

  /** drops the changes to the model that are not yet in the persistent
store */
  void rollback() throws PersistentModelException;
}

Checkpoint allows to bring DB content in a transaction-consistent state.
For example, if you write-through to the database on every add(), you
may have the following problem. Consider adding two statements to the
model:

  (X, rdf:type, PersonWithSocialSecurityNumber)
  (X, SSN, "123-45-6789")

If your application crashes after the first add(), your database becomes
inconsistent (from the viewpoint of the application).

find() invoked on a "dirty" model may throw a PersistentModelException.

Let me know whether such interface fits well into your application
architecture.


> STYLISTIC
> =========
> 
> o Suggestion:  I've added some public well known constants to the interfaces
>         with things like the RDF and RDFS name spaces.

What about interfaces (constant lists) in
org.w3c.rdf.vocabulary.rdf_schema_19990303.RDFS and
org.w3c.rdf.vocabulary.rdf_syntax_19990222.RDF? 

Since different implementations of Resources and Literals may be
necessary, I'm thinking of making them classes instead of interfaces,
which have a static method

static void setNodeFactory(NodeFactory n) {}

Setting the node factory recreates all static variables for Resources
and Literals using the new factory. This may be useful if you have
DB-based implementations of them.

BTW, the next release will include an executable that generates
"vocabulary" classes from a list of URLs of  RDF schemas.

> o Suggestion:  Not all models will be mutable.  Move those methods that
> modify
>         the model into another interface, MutableModel.  These methods would
>         include addStatement, createStatement, createResource,
> createLiteral.  Or
>         need a not implemented exception.

That's a good idea. Following consideration: sometimes the same model is
mutable, sometimes it is not. If it is not, it can throw
ImmutableModelException. To find out, one might need a method like

	boolean isMutable()   (better name?)

PersistentModel should extend MutableModel, otherwise its methods do not
make any sense.


> o Suggestion:  Not all models shoud have to support a query interface so
> move
>          the query methods into a separate SimpleQuery interface.

Hmm, I'm not sure about that. find() is such a fundamental method that
I'd prefer to keep it in the Model interface and throw some
NotImplemented exception instead if you really hate implementing it.

However, sooner or later, we'll have multiple query languages for RDF.
For that, interfaces like RQLQueryableModel with appropriate query
methods are ok.

> o Suggestion: Rename Model.size() to Model.numStatements().  There might be
> many
>         ways to measure the size of a model.  This naming is clearer.

I agree. Current naming was chosen to reflect that of java.util.*
classes.

> o Suggestion: Rename Model.elements() to Model.statements().  This
> terminology
>         is more consistent with the language used in the spec.

Well, even "better" naming would be 

	Model.getNumStatements()
	Model.getStatements()

Don't you think so? That's a crucial change, it affects almost all
classes in the API, so once doing it, let's do it right. Are there other
users of the API around who will be very unhappy about it?

The next release also includes

	boolean Model.isEmpty()

since sometimes getNumStatements() is not available. Is this a good
name?

Thanks again for your comments.

Best,
Sergey


[1] http://www-db.stanford.edu/~melnik/rdf/api.html

Received on Saturday, 6 May 2000 17:28:39 UTC