SPARQL Results Format and Unbound Variables

I hate to do this after last call, but we only started discussing it on
August 2.

The XML Results format specifies that unbound variables are represented
as <binding><unbound/><binding>.  This is relatively concise as long as
it's not repeated too much in each row.  However, it ends up not being
the case with a common (at least for me) use case for UNION.

There are two cases that I'm running into that cause problems:

UNION can be used to smash two queries into one request.  Although this
cuts down on some setup and tear down time, it basically doubles the
number of <binding> elements that are returned.

OPTIONAL in the wrong place can lead to a large fan out of results[1].
If one uses UNION instead, it reduces the possible explosion of rows[2].
  Unfortunately, this spreads the results out many rows, and leaves the
majority of the variables blank.

There are at least two ways to trim the results back down with just
syntax changes.  The least intrusive change would be to just drop the
<unbound> tag, and have it be implicit with <binding name=".."/>.  More
drastic is to just drop the entire <binding> tag when the variable is
unbound, since the information can be retrieved from the head.

To study the effects of these changes, I've picked up two large foaf
data sources. One is a scutter dump from mattb[3], and the other is an
old Julie dump from Christopher Schmidt.  I've placed copies at [4].
I used a query[5] to pick out every person, and optionally their name,
mailbox, homepage, mbox_sha1sum, nick, and seeAlso links.  They may have
many more properties off of them (knows, surname, aim addresses,
depictions, made, etc).

Using ARQ to generate the xml results, I made two result files[6].  The
julie-dump xml results were 121 MB with 234K rows, and the scutter dump
was 25 MB with 46K rows.

Using some simple xslt[7], I was able to create sample result sets with
the  unbounds stripped[8] and the bindings collapsed[9]. The stripped
files were about 68% the size of the original, while the collapsed files
were 45 % of the originals size.

The parse time followed similarly.  I used a dumb script[10] that timed
how long it took for expat's xmlwf to complete. The stripped files took
about 61% of the time to parse as the complete files, and the collapsed
files took about 42% of the time it took to parse the originals.  The
raw results are at [11].

There is a third possibility, much more remote, which would work
independently of the previous suggested changes.  That would be to have
an operator like UNION, but allowed matching graph patterns to be
presented on the same row.  This would effectively fall somewhere
between UNION and OPTIONAL, and facilitate querying for multiple arity
predicates.  Query cascading might have similar benefits.  I wouldn't
expect anything like this to be done in last call.

However, I would like to see some discussion on the size of the results


[1] See "Multiple Arity Predicates" on
[2] Andy Seaborne's UNION comment on
[3] Originally from

Received on Saturday, 6 August 2005 14:53:37 UTC