I view RDF and related standards in a number of ways, ranging from simple application use to AI. One key part is that I think that RDF is the next logical step past XML in data flexibility and clarity of representation. Especially when creating integration and database mechanisms. From a certain point of view, XML seems like a special case of an RDF equivalent representation. Although even more inefficient currently.

Damian Steer wrote:

On 23 Jul 2008, at 10:07, Olivier Rossel wrote:

I was wondering how to improve the loading time of RDF files in
semantic web frameworks.
And then came a question: is RDF efficient to load?
The obvious answer is no.

I'm not sure that is obvious, but go on...

Have you done it? ;-) (Just kidding. Maybe better: Have you noticed how inefficient it is?)
Compare loading 1MB in 128K chunks as binary data with loading 1MB of RDF data, or even 1MB of gzipped RDF data. What's the multiple?

You may argue that it's a whole different thing. I would argue that A) that's not necessarily true and B) loading the binary data is the theoretical maximum to determine efficiency rating and C) parsing and serializing data as it is currently done for both XML and RDF is worst case in some important ways.

Making it readable for humans makes it definitely slower to load in programs.

And I'm not convinced about that, either.

In my early binary/efficient XML work, I imagined a number of ways to make XML faster while staying text-based. There are too many good ways to improve it without that constraint, as we've shown in W3C XBC and EXI working groups. Text-formatted data, outside of a few special cases, is much slower to process than some good alternatives. One big point however with having "efficient XML" and "efficient RDF" is the ability to express the same data in text or binary form, including the ability to losslessly "round trip". Some more-purists want to "defend" XML by insisting that any other encoding is "not XML" and would pollute / confuse the market too much, detracting from the success of XML. Some of us think that having an encoding that is more widely usable in more application situations while also improving many existing application uses and being equivalent to the text-based standard only improves the value of XML. I feel the same would hold true, more so, for RDF.

Which is not to say, by me anyway, that new insights to desirable features might not come up in the process that weren't apparent in a text-only context.

So I came to another question:
Is there a computer-optimized format for RDF?
Something that would make it load much faster.

For small numbers of triples you may be right, but (as Bijan says) gzipped n-triples are probably adequate.

Gzipping, as with XML, only increases the CPU / memory required. It helps with size, which also does help a bit with network latency in some cases. Frequently however, bandwidth isn't the issue, CPU and memory bandwidth is. Often they both are. Note that a properly designed format may get the benefits of gzip-like compression without actually incurring nearly as much cost as such a generic, search-based algorithm, while possibly incurring much less decode effort.

Let us never mention binary xml on this list again :-)

Sorry, I'm a believer (in binary/efficient XML/RDF...).
I was feeling like I've been the only one interested in binary/efficient RDF for the last couple years.

For large numbers of triples, in my limited experience, the things that affect RDF load speed are:

The speed of your disk.
The size of your memory.
Building indexes.
Duplicate suppression (triple, node, whatever).
BNode handling.
IRI and datatype checks (if you do them).
Parsing.

Now parsing is a factor, but it's fairly minor compared with the basic business of storing the triples. Stores would probably get more benefit from simple processing instructions like 'this contains no dupes' and 'my bnode ids are globally unique'.

Parsing and serialization become principle problems as application data and transaction rate increases, especially after other optimization is completed. It is not just parsing/serialization at the database interface, but the whole application stack. Either RDF (like XML) is a good data representation that gets used wider and wider, or it isn't and it gets marginalized as something more efficient is used except at required external interfaces.

Damian

Sandro Hawke wrote:

For large numbers of triples, in my limited experience, the things  
that affect RDF load speed

Ooo, I got a bit side tracked by the parsing bit.

are:

The speed of your disk.
The size of your memory.
Building indexes.
Duplicate suppression (triple, node, whatever).
BNode handling.
IRI and datatype checks (if you do them).
Parsing.

Now parsing is a factor, but it's fairly minor compared with the  
basic business of storing the triples.

Indeed.

Storing them in memory is not nearly as expensive as parsing or serialization. Both of those steps are expensive and adding gzip only increases the expense. Modern application architectures have a lot more components, tiers, and communication events than just an application talking directly to a database.

Stores would probably get more benefit from simple processing  
instructions like 'this contains no dupes' and 'my bnode ids are  
globally unique'.

SWI Prolog had, IIRC, a mode to dump its internal structures so you  
would avoid all that overhead (kinda like an image in Smalltalk or  
lisp). Obviously databases do this as well.

Hard to see that a common format would makea  *ton* of sense. I guess  
you could suppress dups, reconcile bnodes, and a few other things.  
Indexes? I don't think so. That seems entirely proprietary and  
appropriately so.


I can imagine a demand for an RDF exchange format that is actually a
position-independent/architecture-independent memory image of an indexed
quad store.  The sender could include the indexes it thinks will be
useful; the receiver could drop/regenerate indexes as needed.

I am working on it. I'll post soon if possible. I will have a rough spec shortly and code sometime after. I would like help reviewing from a number of points of view. I've come to a few new (and probably some old) conclusions that I think are right, but need some discussion to validate.

For "binary XML", now "efficient XML", check out our working group:
http://www.w3.org/XML/EXI/

At some point soon these will be updated:
http://www.openeri.org - Initial home for spec, Open source efficient RDF implementation
http://www.openexi.org - Pointers to W3C EXI spec / experimental work, Open source efficient XML implementation

My older, and eventually future personal research into improvements:
http://www.esxml.org - My EXI candidate, likely future EXI-experimental work

This would make sense for the fairly-rare applications where
network/memory speed outstrip CPU speed -- where parsing time (and such)
are the real bottleneck.  From time to time I read that the bandwidth
improvement curve is steeper than the CPU improvement curve, so we'll
all be there eventually.  I'm not sure I believe it.    If we are, the
demand for this kind of RDF format will grow.   For now, I don't see
much demand.

I don't have hard numbers on how many RDF applications are out there and how things are being used, however in experience with my teams, representing graphs of various kinds of data in RDF, a smallish amount of information would be 500K of data and take several seconds to parse in .Net and Java. There are multiple ways that this can be improved drastically.

sdw

     -- Sandro