On 23 Jul 2008, at 10:07, Olivier Rossel wrote:
I was wondering how to improve the loading time of RDF files in
semantic web frameworks.
And then came a question: is RDF efficient to load?
The obvious answer is no.
I'm not sure that is obvious, but go on...Have you done it? ;-) (Just kidding. Maybe better: Have you noticed how inefficient it is?)
In my early binary/efficient XML work, I imagined a number of ways to make XML faster while staying text-based. There are too many good ways to improve it without that constraint, as we've shown in W3C XBC and EXI working groups. Text-formatted data, outside of a few special cases, is much slower to process than some good alternatives. One big point however with having "efficient XML" and "efficient RDF" is the ability to express the same data in text or binary form, including the ability to losslessly "round trip". Some more-purists want to "defend" XML by insisting that any other encoding is "not XML" and would pollute / confuse the market too much, detracting from the success of XML. Some of us think that having an encoding that is more widely usable in more application situations while also improving many existing application uses and being equivalent to the text-based standard only improves the value of XML. I feel the same would hold true, more so, for RDF.Making it readable for humans makes it definitely slower to load in programs.And I'm not convinced about that, either.
Gzipping, as with XML, only increases the CPU / memory required. It helps with size, which also does help a bit with network latency in some cases. Frequently however, bandwidth isn't the issue, CPU and memory bandwidth is. Often they both are. Note that a properly designed format may get the benefits of gzip-like compression without actually incurring nearly as much cost as such a generic, search-based algorithm, while possibly incurring much less decode effort.So I came to another question:
Is there a computer-optimized format for RDF?
Something that would make it load much faster.
For small numbers of triples you may be right, but (as Bijan says) gzipped n-triples are probably adequate.
Let us never mention binary xml on this list again :-)Sorry, I'm a believer (in binary/efficient XML/RDF...).
For large numbers of triples, in my limited experience, the things that affect RDF load speed are:Parsing and serialization become principle problems as application data and transaction rate increases, especially after other optimization is completed. It is not just parsing/serialization at the database interface, but the whole application stack. Either RDF (like XML) is a good data representation that gets used wider and wider, or it isn't and it gets marginalized as something more efficient is used except at required external interfaces.
The speed of your disk.
The size of your memory.
Building indexes.
Duplicate suppression (triple, node, whatever).
BNode handling.
IRI and datatype checks (if you do them).
Parsing.
Now parsing is a factor, but it's fairly minor compared with the basic business of storing the triples. Stores would probably get more benefit from simple processing instructions like 'this contains no dupes' and 'my bnode ids are globally unique'.
Damian
Storing them in memory is not nearly as expensive as parsing or serialization. Both of those steps are expensive and adding gzip only increases the expense. Modern application architectures have a lot more components, tiers, and communication events than just an application talking directly to a database.For large numbers of triples, in my limited experience, the things that affect RDF load speedOoo, I got a bit side tracked by the parsing bit.are: The speed of your disk. The size of your memory. Building indexes. Duplicate suppression (triple, node, whatever). BNode handling. IRI and datatype checks (if you do them). Parsing. Now parsing is a factor, but it's fairly minor compared with the basic business of storing the triples.Indeed.
I am working on it. I'll post soon if possible. I will have a rough spec shortly and code sometime after. I would like help reviewing from a number of points of view. I've come to a few new (and probably some old) conclusions that I think are right, but need some discussion to validate.Stores would probably get more benefit from simple processing instructions like 'this contains no dupes' and 'my bnode ids are globally unique'.SWI Prolog had, IIRC, a mode to dump its internal structures so you would avoid all that overhead (kinda like an image in Smalltalk or lisp). Obviously databases do this as well. Hard to see that a common format would makea *ton* of sense. I guess you could suppress dups, reconcile bnodes, and a few other things. Indexes? I don't think so. That seems entirely proprietary and appropriately so.I can imagine a demand for an RDF exchange format that is actually a position-independent/architecture-independent memory image of an indexed quad store. The sender could include the indexes it thinks will be useful; the receiver could drop/regenerate indexes as needed.
I don't have hard numbers on how many RDF applications are out there and how things are being used, however in experience with my teams, representing graphs of various kinds of data in RDF, a smallish amount of information would be 500K of data and take several seconds to parse in .Net and Java. There are multiple ways that this can be improved drastically.This would make sense for the fairly-rare applications where network/memory speed outstrip CPU speed -- where parsing time (and such) are the real bottleneck. From time to time I read that the bandwidth improvement curve is steeper than the CPU improvement curve, so we'll all be there eventually. I'm not sure I believe it. If we are, the demand for this kind of RDF format will grow. For now, I don't see much demand.
-- Sandro