- From: Jeen Broekstra <jeen@aduna.biz>
- Date: Thu, 27 Oct 2005 16:13:16 +0200
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- Message-ID: <4360E07C.7060505@aduna.biz>
Attached you will find a description of a binary query results format, based on the binary format currently implemented in Sesame. If you care to play with the result format, you can go to http://www.openrdf.org/sesame/ , choose the Museum demonstrator database, and select the 'SeRQL-s' option. The interface allows you to specify a SeRQL query (examples are available) and you can set the desired result format (HTML, XML, RDF, or binary). Comments and feedback most welcome. Cheers, Jeen -- Jeen Broekstra Aduna BV Knowledge Engineer Julianaplein 14b, 3817 CS Amersfoort http://aduna.biz The Netherlands tel. +31 33 46599877
This document is based on the binary query results format as available in the Sesame RDF Framework (http://www.openrdf.org/), release 1.2 and later. The relevant Javadoc for Sesame's implementation can be found at http://www.openrdf.org/doc/sesame/api/org/openrdf/sesame/query/BinaryTableResultConstants.html A Binary Query Results Format ----------------------------- Motivation ========== This binary query results format is designed with these main goals in mind: - minimizing the number of bytes sent over the network. - allowing streaming processing of a query result. Historically, the format was conceived as an answer to use cases in the Sesame community where large query results containing duplicate column values in multiple rows or lots of null (unbound) values have to be communicated between the server and a client application. The binary result format typically encodes a query result in 5-25% of the space needed when using the SPARQL XML query result format. The format allows streaming processing. Header ====== Results encoded in the binary query results format consist of a header followed by zero or more records. Values are stored in network order (Big-Endian). The header is 12 bytes long: - Bytes 0-3 contain the ASCII codes for the string "BRTR", which stands for Binary RDF Table Result. - Bytes 4-7 specify the format version (a 32-bit signed integer). - Bytes 8-11 specify the number of columns of the query result that will follow (a 32-bit signed integer). For example, a header for a result in format version 1, containing 5 columns, will look like this: byte: 0 1 2 3 | 4 5 6 7 | 8 9 10 11 -------------------+-------------+-------------- value: B R T R | 0 0 0 1 | 0 0 0 5 Following this are the column headers, which are encoded as UTF-8 strings (see section on UTF-8 encoding of strings, below, for details). There are as many column headers as the number of columns that has been specified in the header. Results ======= Zero or more records follow after the column headers. This can be a mixture of records describing a result and supporting records. The results table is described by the result records which are written from left to right, from top to bottom. Each record starts with a record type marker (a single byte). The following records are defined in the current format: - NULL (byte value: 0): This indicates a NULL value in the table and consists of nothing more than the record type marker. This marker is used to indicate that no there is no binding for the current variable. - REPEAT (byte value: 1): This indicates that the next value is identical to the value in the same column in the previous row. The REPEAT record consists of nothing more than the record type marker. - NAMESPACE (byte value: 2): This is a supporting record that assigns an ID (non-negative integer) to a namespace. This ID can later be used in a QNAME record to combine it with a local name to form a full URI. The record type marker is followed by a non-negative 32-bit signed integer for the ID and an UTF-8 encoded string for the namespace. - QNAME (byte value: 3): This indicates a URI value, the value of which is encoded as a namespace ID and a local name. The namespace ID is required to be mapped to a namespace in a previous NAMESPACE record. The record type marker is followed by a non-negative 32-bit signed integer (the namespace ID) and an UTF-8 encoded string for the local name. - URI (byte value: 4): This also indicates a URI value, but one that does not use a namespace ID. This record type marker is simply followed by an UTF-8 encoded string for the full URI. - BNODE (byte value: 5): This indicates a blank node. The record type marker is followed by an UTF-8 encoded string for the bnode ID. - PLAIN_LITERAL (byte value: 6): This indicates a plain literal value. The record type marker is followed by an UTF-8 encoded string for the literal's label. - LANG_LITERAL (byte value: 7): This indicates a literal value with a language attribute. The record type marker is followed by an UTF-8 encoded string for the literal's label, followed by an UTF-8 encoded string for the language attribute. - DATATYPE_LITERAL (byte value: 8): This indicates a datatyped literal. The record type marker is followed by an UTF-8 encoded string for the literal's label. Following this label is either a QNAME or URI record for the literal's datatype. - ERROR (byte value: 126): This record indicates a error. The type of error is indicated by the byte directly following the record type marker: 1 for a malformed query error, 2 for a query evaluation error. The error type byte is followed by an UTF-8 string for the error message. - TABLE_END (byte value: 127): This is a special record that indicates the end of the results table and consists of nothing more than the record type marker. Any data following this record should be ignored. UTF-8 String encoding ===================== (Note: In the current Sesame implementation, a modified UTF-8 encoding scheme is used, which is the default UTF-8 encoding scheme in Java (see http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8 for details). Obviously this can be generalized/changed to be more standards-compliant. I am documenting the current Sesame/Java scheme here for now though). Each value encoded as an UTF-8 string is preceeded with a 2-byte prefix indicating the byte-length of the encoded string. The length is stored as a 16-bit unsigned short integer. (Note: 2 bytes may not be big enough for our purposes, so we might want to change this to 4 bytes. Fortunately we have a format version indicator :)) Each character in the string is converted to a group of one, two, or three bytes, depending on the value of the character. If a character c is in the range \u0001 through \u007f, it is represented by one byte: (byte)c If a character c is \u0000 or is in the range \u0080 through \u07ff, then it is represented by two bytes, to be written in the order shown: (byte)(0xc0 | (0x1f & (c >> 6))) (byte)(0x80 | (0x3f & c)) If a character c is in the range \u0800 through uffff, then it is represented by three bytes, to be written in the order shown: (byte)(0xe0 | (0x0f & (c >> 12))) (byte)(0x80 | (0x3f & (c >> 6))) (byte)(0x80 | (0x3f & c)) The length prefix indicates the length of the resulting array of bytes.
Received on Thursday, 27 October 2005 14:12:10 UTC