RE: Streamability II from Seaborne, Andy on 2004-06-16 (public-rdf-dawg@w3.org from April to June 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 16 Jun 2004 22:42:14 +0100
To: "'Janne Saarela'" <janne.saarela@profium.com>
Cc: "'RDF Data Access Working Group'" <public-rdf-dawg@w3.org>
Message-ID: <000c01c453ea$cbe435c0$0a01a8c0@atlas>
-------- Original Message --------
> From: public-rdf-dawg-request@w3.org <>
> Date: 16 June 2004 13:45
> 
> > With a streamable protocol the loop above can be executed the first
> > time as soon as the first result binding (row in the result table)
> > arrives.  This is Jim's example of do something with first 100 results
> > while the server is still dealing with the next 100 done in the style
> > of iterators.  The above code does nto require all the results to be
> > in memory at the same time.
> 
> Agreed, this would be familiar for developers who've been
> dealing SAX programming model.
> 
> > In Jena the classes are: QueryResults [1] for the iterator and
> > ResultBinding [2] for each row of the conceptual result table.
> > 
> > [1]
> > 
> http://jena.sourceforge.net/javadoc/com/hp/hpl> /jena/rdql/QueryResults.
> > html
> > [2]
> > 
> http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/rdql/Resul
> tBinding.html 
> 
> See also java.sql.ResultSet.  The API style is to deliver one row of
> the result set to the application at a time with
> java.sql.ResultSet#next() &
> java.sql.ResultSet#isLast() Result set can be TYPE_FORWARD_ONLY.
> 
> This may be too concrete an example but pls bear with me as I need to
> understand how this would work in practice: 
> 
> I can imagine applications blocking with a call to next() even if
> isLast() says 'false'. This is not what I want query clients to
> experience (e.g. network problem would hang a program until TCP level
> says 'timeout').

A blocking .next() isn't a necessary design but it does make the application
easier for the common design pattern of looping over results.  Select (the
system call) style does make for complicated programming but allows a
singlethreaded app to multitask. We could have had a .isMoreReady() call to
allow a guard on a blocking .next(); .next() could be non-blockign returnign
a "not ready" indicator.  This is select[2] style.

Jena's query system is multithreaded because it is easier that way: the
application thread loops, pulling result rows out of a bounded buffer.
Blocking occurs if the buffer is exhausted; there is an explicit
end-of-results token.  Being Java, therading comes for little work.  The
query engine puts results into the bounded buffer as it generates them; the
application pulls them out.  It could provide a peek() for results (test for
blocking) but doesn't.

[Aside: I chose not to implement a JDBC interface because I couldn't see how
to provide the full interface, so an RDF interface would not be plug
compatible.]

> This is not what I want query clients to
> experience (e.g. network problem would hang a program until TCP level
> says 'timeout').

At some level, if the data isn't available, there are choices to be made.
Either wait until it all arrives first or provide some kind of per-row
interface.

> 
> If the call to next() is asyncronous, I guess we would then need a
> good'old select() type of call familiar to C programmers who've dealt
> with file descriptors.  
> 
> Bottom line question: will the streaming protocol effectively require
> more work from developers?

No - its not more work.  An API design has choices but there are several
well-known design patterns here.

My experience with a multithreaded implementation was that it was easier.
No complex, application level select() calls and deciding what to do if
select() says "nothing there".  However, my background includes multithread
systems and languages.

> I cannot see how to map it to current practice
> of using ResultSets and alike without additional low level IO management?

The TCP stack already has receive-side byte buffering so if there is about
the size of a result row or larger, the client is already buffering the next
result or several anyway.  They need parsing but either a parse-on-demand
style (select syscall, blocking .next(), .next synchronously calls the
socket to get data and deserialize it) or results parser acting
synchronously and blocking on the TCP socket, asynchronously with the
application, placing results in (e.g.) a bounded buffer works.

In Jena, results are streamed with iterators.  If it were not for the JDBC
limitations we have encountered (in the general case - we need per JDBC
driver code to compensate), we would stream from DB server through a network
JDBC connection to application, with buffering to smooth burst behaviour.
As JDBC is primarily for a closer-than-web coupling of client and server,
this is good.  When querying an in-memory RDF graph, the memory overhead is
around 20 statements (buffering) and it goes faster on multi-CPU systems.

One other design style is callbacks: instead of a result loop, the
application hands in a callback function which is called on each result row.
This is more natural on a limited, single threaded system but is more
application-complex than a loop (unless your programming language has
continuations).

I'm sure there are other patterns, and significant variations on the three
(select/.next block, multithreaded, calback) I have tounched on here.

	Andy

> 
> I am sure if someone with ODBC experience could tell how this gray area
> (for me) was solved I would feel more comfortable. 
> 
> Janne
Received on Wednesday, 16 June 2004 17:42:47 UTC