binary query results format, draft

Attached you will find a description of a binary query results format,
based on the binary format currently implemented in Sesame.

If you care to play with the result format, you can go to
http://www.openrdf.org/sesame/ , choose the Museum demonstrator
database, and select the 'SeRQL-s' option. The interface allows you to
specify a SeRQL query (examples are available) and you can set the
desired result format (HTML, XML, RDF, or binary).

Comments and feedback most welcome.

Cheers,

Jeen
-- 
Jeen Broekstra          Aduna BV
Knowledge Engineer      Julianaplein 14b, 3817 CS Amersfoort
http://aduna.biz        The Netherlands
tel. +31 33 46599877
This document is based on the binary query results format as available in
the Sesame RDF Framework (http://www.openrdf.org/), release 1.2 and later.
The relevant Javadoc for Sesame's implementation can be found at
http://www.openrdf.org/doc/sesame/api/org/openrdf/sesame/query/BinaryTableResultConstants.html

A Binary Query Results Format
-----------------------------

Motivation
==========

This binary query results format is designed with these main goals in
mind:

 - minimizing the number of bytes sent over the network.
 - allowing streaming processing of a query result.

Historically, the format was conceived as an answer to use cases in the
Sesame community where large query results containing duplicate column
values in multiple rows or lots of null (unbound) values have to be
communicated between the server and a client application. The binary
result format typically encodes a query result in 5-25% of the space
needed when using the SPARQL XML query result format. The format allows
streaming processing.

Header
======

Results encoded in the binary query results format consist of a header
followed by zero or more records. Values are stored in network order
(Big-Endian).

The header is 12 bytes long:

 - Bytes 0-3 contain the ASCII codes for the string "BRTR", which stands for
   Binary RDF Table Result.
 - Bytes 4-7 specify the format version (a 32-bit signed integer).
 - Bytes 8-11 specify the number of columns of the query result that will
   follow (a 32-bit signed integer).

For example, a header for a result in format version 1, containing 5
columns, will look like this:

  byte: 0  1  2  3 |  4  5  6  7 |  8  9  10  11
-------------------+-------------+--------------
 value: B  R  T  R |  0  0  0  1 |  0  0   0   5

Following this are the column headers, which are encoded as UTF-8 strings
(see section on UTF-8 encoding of strings, below, for details). There are
as many column headers as the number of columns that has been specified in
the header.

Results
=======

Zero or more records follow after the column headers. This can be a mixture
of records describing a result and supporting records. The results table is
described by the result records which are written from left to right, from
top to bottom. Each record starts with a record type marker (a single byte).
The following records are defined in the current format:

- NULL (byte value: 0):
    This indicates a NULL value in the table and consists of nothing more
    than the record type marker. This marker is used to indicate that no
    there is no binding for the current variable.

- REPEAT (byte value: 1):
    This indicates that the next value is identical to the value in the same
    column in the previous row. The REPEAT record consists of nothing more than
    the record type marker.

- NAMESPACE (byte value: 2):
    This is a supporting record that assigns an ID (non-negative integer)
    to a namespace. This ID can later be used in a QNAME record to combine
    it with a local name to form a full URI. The record type marker is
    followed by a non-negative 32-bit signed integer for the ID and an
    UTF-8 encoded string for the namespace.

- QNAME (byte value: 3):
    This indicates a URI value, the value of which is encoded as a
    namespace ID and a local name. The namespace ID is required to be
    mapped to a namespace in a previous NAMESPACE record. The record type
    marker is followed by a non-negative 32-bit signed integer (the
    namespace ID) and an UTF-8 encoded string for the local name.

- URI (byte value: 4):
    This also indicates a URI value, but one that does not use a namespace ID.
    This record type marker is simply followed by an UTF-8 encoded string for the
    full URI.

- BNODE (byte value: 5):
    This indicates a blank node. The record type marker is followed by an UTF-8
    encoded string for the bnode ID.

- PLAIN_LITERAL (byte value: 6):
    This indicates a plain literal value. The record type marker is followed by
    an UTF-8 encoded string for the literal's label.

- LANG_LITERAL (byte value: 7):
    This indicates a literal value with a language attribute. The record type
    marker is followed by an UTF-8 encoded string for the literal's label,
    followed by an UTF-8 encoded string for the language attribute.

- DATATYPE_LITERAL (byte value: 8):
    This indicates a datatyped literal. The record type marker is followed by an
    UTF-8 encoded string for the literal's label. Following this label is either
    a QNAME or URI record for the literal's datatype.

- ERROR (byte value: 126):
    This record indicates a error. The type of error is indicated by the byte
    directly following the record type marker: 1 for a malformed query
    error, 2 for a query evaluation error. The error type byte is
    followed by an UTF-8 string for the error message.

- TABLE_END (byte value: 127):
    This is a special record that indicates the end of the results table and
    consists of nothing more than the record type marker. Any data following this
    record should be ignored.

UTF-8 String encoding
=====================

(Note: In the current Sesame implementation, a modified UTF-8 encoding
scheme is used, which is the default UTF-8 encoding scheme in Java (see
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8
for details). Obviously this can be generalized/changed to be more
standards-compliant. I am documenting the current Sesame/Java scheme here
for now though).

Each value encoded as an UTF-8 string is preceeded with a 2-byte prefix
indicating the byte-length of the encoded string. The length is stored as
a 16-bit unsigned short integer.

(Note: 2 bytes may not be big enough for our purposes, so we might want to
 change this to 4 bytes. Fortunately we have a format version indicator :))

Each character in the string is converted to a group of one, two, or three
bytes, depending on the value of the character.

If a character c is in the range \u0001 through \u007f, it is represented
by one byte:

(byte)c 

If a character c is \u0000 or is in the range \u0080 through \u07ff, then
it is represented by two bytes, to be written in the order shown:

 (byte)(0xc0 | (0x1f & (c >> 6)))
 (byte)(0x80 | (0x3f & c))
  
If a character c is in the range \u0800 through uffff, then it is
represented by three bytes, to be written in the order shown:

 (byte)(0xe0 | (0x0f & (c >> 12)))
 (byte)(0x80 | (0x3f & (c >>  6)))
 (byte)(0x80 | (0x3f & c))

The length prefix indicates the length of the resulting array of bytes.

Received on Thursday, 27 October 2005 14:12:10 UTC