Re: About computer-optimized RDF format. from Bijan Parsia on 2008-07-24 (semantic-web@w3.org from July 2008)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Thu, 24 Jul 2008 14:23:30 +0100
To: "Stephen D. Williams" <sdw@lig.net>
Cc: Sandro Hawke <sandro@w3.org>, Damian Steer <pldms@mac.com>, Olivier Rossel <olivier.rossel@gmail.com>, Semantic Web <semantic-web@w3.org>
Message-Id: <2BE6E663-DAFD-4D91-B386-D6A771AFE4FA@cs.man.ac.uk>
On 24 Jul 2008, at 03:08, Stephen D. Williams wrote:

> I view RDF and related standards in a number of ways, ranging from  
> simple application use to AI.  One key part is that I think that  
> RDF is the next logical step past XML in data flexibility and  
> clarity of representation.

This is by no means obvious. See the lengthy thread starting:
	<http://www.w3.org/mid/486A0516.2010702@gmail.com>

> Especially when creating integration and database mechanisms.  From  
> a certain point of view, XML seems like a special case of an RDF  
> equivalent representation.

I don't think there's a meaningful sense which XML is a special case  
of RDF. Even if you just consider data structures and claim that XML  
(qua "trees") is a special case of RDF (qua "graphs"), it's just not  
very helpful

>   Although even more inefficient currently.
>
> Damian Steer wrote:
>> On 23 Jul 2008, at 10:07, Olivier Rossel wrote:
>>> I was wondering how to improve the loading time of RDF files in
>>> semantic web frameworks.
>>> And then came a question: is RDF efficient to load?
>>> The obvious answer is no.
>> I'm not sure that is obvious, but go on...
> Have you done it?  ;-)

I'm very sure he has. As the rest of his email showed.

> (Just kidding.  Maybe better: Have you noticed how inefficient it is?)

I think you conflate implementation with the problem.

> Compare loading 1MB in 128K chunks as binary data with loading 1MB  
> of RDF data, or even 1MB of gzipped RDF data.  What's the multiple?

I feel very confident that I can trivially make this come out however  
you like. For example, If the binary data is something encrypted and  
compressed due to a very complex algorithm and requires regenerating  
very large and complex structures it'll be crushed by parsing RDF to  
a list of triples.

> You may argue that it's a whole different thing.  I would argue  
> that A) that's not necessarily true and B) loading the binary data  
> is the theoretical maximum to determine efficiency rating

It would be nice if you argued this point instead of asserting it.  
Preferably with empirical data. And, of course, we don't typically  
need a maximally efficient (unless we also through in human effort as  
a component).

> and C) parsing and serializing data as it is currently done for  
> both XML and RDF is worst case in some important ways.

Like what? Data is better than vague claims.

>>> Making it readable for humans makes it definitely slower to load  
>>> in programs.
>> And I'm not convinced about that, either.
> In my early binary/efficient XML work, I imagined a number of ways  
> to make XML faster while staying text-based.  There are too many  
> good ways to improve it without that constraint, as we've shown in  
> W3C XBC and EXI working groups.  Text-formatted data, outside of a  
> few special cases, is much slower to process than some good  
> alternatives.

Again, we're not being especially grounded here. At the moment, I  
don't think that RDF loading from text formats is a key issue  
(although I have been bit by multi-day loads :( But that was really a  
matter of a bad implementation.)

>   One big point however with having "efficient XML" and "efficient  
> RDF" is the ability to express the same data in text or binary  
> form, including the ability to losslessly "round trip".  Some more- 
> purists want to "defend" XML by insisting that any other encoding  
> is "not XML" and would pollute / confuse the market too much,  
> detracting from the success of XML.  Some of us think that having  
> an encoding that is more widely usable in more application  
> situations while also improving many existing application uses and  
> being equivalent to the text-based standard only improves the value  
> of XML.  I feel the same would hold true, more so, for RDF.

Without specifics, this isn't at all clear.

> Which is not to say, by me anyway, that new insights to desirable  
> features might not come up in the process that weren't apparent in  
> a text-only context.
>>> So I came to another question:
>>> Is there a computer-optimized format for RDF?
>>> Something that would make it load much faster.
>>
>> For small numbers of triples you may be right, but (as Bijan says)  
>> gzipped n-triples are probably adequate.
> Gzipping, as with XML, only increases the CPU / memory required.

Perhaps CPU, but not necessarily memory. And this isn't typically CPU  
bound.

>   It helps with size, which also does help a bit with network  
> latency in some cases.

It helps with disk access. The less IO you use the better off you  
are, generally (hence tweaking buffers).

>   Frequently however, bandwidth isn't the issue, CPU and memory  
> bandwidth is.  Often they both are.

I'd be surprised, very surprised if CPU or memory *bandwidth* was at  
all an issue or an issue that wasn't greatly improved by gzipping.  
And this ignores that time might be dominated by building not  
particularly portable structures (e.g., indexes).

>   Note that a properly designed format may get the benefits of gzip- 
> like compression without actually incurring nearly as much cost as  
> such a generic, search-based algorithm, while possibly incurring  
> much less decode effort.

But requires considerably more design and implementation effort.

> Sandro Hawke wrote:
I actually wrote most of the text here :)
>>>> For large numbers of triples, in my limited experience, the  
>>>> things that affect RDF load speed
>>> Ooo, I got a bit side tracked by the parsing bit.
>>>>
>>>> are: The speed of your disk. The size of your memory. Building  
>>>> indexes. Duplicate suppression (triple, node, whatever). BNode  
>>>> handling. IRI and datatype checks (if you do them). Parsing. Now  
>>>> parsing is a factor, but it's fairly minor compared with the  
>>>> basic business of storing the triples.
>>> Indeed.
> Storing them in memory is not nearly as expensive as parsing or  
> serialization.

This is counter to voiced experience. Sometimes it is; sometimes it  
isn't.

>   Both of those steps are expensive and adding gzip only increases  
> the expense.
[snip]

Clearly false. I mean, c'mon. It's just the case that there are  
situations that compressing increases raw parsing/serialization  
performance. If you compress a file 90% and decompress it as part of  
the parsing process so that your io cost shrink by 90% it's clear  
that you're going to do better. (And the more you can get in e.g., L2  
the better as well.)

(A quick google found:
	http://www.lst.inf.ethz.ch/research/publications/publications/ 
USENIX_2005/USENIX_2005.pdf
	http://support.citrix.com/article/CTX112407
	http://www.cs.utexas.edu/users/oops/papers.html esp. paper 5)

Now, when it's the case and when it isn't is an important question  
and dependent on the application. The advantage of gzipping is that  
it's easy to get going. For many cases it more than likely is  
sufficient.

Cheers,
Bijan.
Received on Thursday, 24 July 2008 13:21:19 UTC