Re: Streaming SPARQL result to disk from Simon Spero on 2014-07-19 (semantic-web@w3.org from July 2014)

From: Simon Spero <sesuncedu@gmail.com>
Date: Sat, 19 Jul 2014 15:52:20 -0400
To: "Souza, Renan F. S." <renan123@missouristate.edu>
Cc: Luca Matteis <lmatteis@gmail.com>, "semantic-web@w3.org Web" <semantic-web@w3.org>
Message-ID: <CADE8KM79JzcHUEJwpak9_MBqiLxuxBm+fcXBgiUFDCwbOGHT_w@mail.gmail.com>
It's not entirely clear whether the system that is hitting resource limits
 is the server or the client.

If the result set which to which  "ORDER BY"  will be applied is too big,
the resource constraints will be hit before the "OFFSET" can be applied.
The server may already be using external storage for the temporary result
set.

  If OFFSET and LIMIT are used for paging, it *may* be possible to use
SPARQL UPDATE to create a temporary graph whose results do not need an
ORDER BY  across multiple calls; however, whether or not this gives useful
results is implementation dependent
<http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#modOffset>.

 Since most engines provide an implementation specific way to dump a named
graph to disk, there may be no need to use SPARQL query to fetch the
results;  however, it is important to remove the graph once it has been
processed (just like when using non "TEMPORARY" temporary tables in SQL).

If the resource limits are being exceeded on the client side, it is easy to
use an API that processes results as they are produced;  for example Sesame
queries generally provide a pair of methods, one of which builds a complete
result object, the other taking a handler object that will be called for
each set of bindings or graph.

For example:

SPARQLTupleQuery::evaluate
<http://openrdf.callimachus.net/sesame/2.7/apidocs/org/openrdf/repository/sparql/query/SPARQLTupleQuery.html#evaluate(org.openrdf.query.TupleQueryResultHandler)>
can
be passed an instance of SPARQLResultsJSONWriter
<http://openrdf.callimachus.net/sesame/2.7/apidocs/org/openrdf/query/resultio/sparqljson/SPARQLResultsJSONWriter.html>
 .

and

SPARQLGraphQuery::evaluate
<http://openrdf.callimachus.net/sesame/2.7/apidocs/org/openrdf/repository/sparql/query/SPARQLGraphQuery.html#evaluate(org.openrdf.rio.RDFHandler)>can
be passed an instance of NTriplesWriter
<http://openrdf.callimachus.net/sesame/2.7/apidocs/org/openrdf/rio/ntriples/NTriplesWriter.html>
 .

Similarly, Jena  provides QueryExecution::execConstructTriples
<http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/query/QueryExecution.html#execConstructTriples()>
,
which returns an instance of Iterator<Triple> ; this need not build a
complete result set.  The QueryExecution object can be constructed by
calling one of the various QueryExecutionFactory::sparqlService
<http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/query/QueryExecutionFactory.html#sparqlService(java.lang.String,
com.hp.hpl.jena.query.Query)> methods

By avoiding repeated queries using  ORDER BY, OFFSET, and LIMIT, the load
on the server can be greatly reduced.

Simon


On Sat, Jul 19, 2014 at 1:48 PM, Souza, Renan F. S. <
renan123@missouristate.edu> wrote:

> Not sure if triple store implementations allow you to do that directly.
> One thing you could try is to use LIMIT and OFFSET (with ORDER BY)
> modifiers so that the result would fit in memory, then you write the result
> in a file. Do that as many times as needed until you have no more results
> left. That would work if each query that uses LIMIT, OFFSET and ORDER BY
> does not take too long to run.
>
> You can use the COUNT modifier to check how many times you would need to
> do that.
> Of course, if the results are really that big, I would write a simple
> program to do the job.
>
>
>
>
>
> On Fri, Jul 18, 2014 at 6:57 PM, Luca Matteis <lmatteis@gmail.com> wrote:
>
>> Hello,
>>
>> I'm executing a SPARQL query against a large endpoint I've setup
>> locally. The problem is that the result of this query is too large to
>> be held in memory. Are there endpoints that allow me to stream the
>> results to disk? For example, if it's a CONSTRUCT query it could
>> stream the N-Triples line by line to disk.
>>
>> Thank you,
>> Luca
>>
>>
>
>
> --
> Thank you!
> Regards,
>
> Souza, Renan F. S.
> Bachelor of Computer Science
> Missouri State University, Springfield, MO
> Masters in Computer Systems Engineering
> Federal University of Rio de Janeiro, Brazil
>
>
> +55-21-99257-3934
> Personal email: renan-francisco@hotmail.com
>
Received on Saturday, 19 July 2014 19:52:47 UTC