Re: About computer-optimized RDF format.

On Jul 24, 2008, at 7:21 PM, Stephen Williams wrote:

> Bijan Parsia wrote:
>> On 24 Jul 2008, at 03:08, Stephen D. Williams wrote:
>>
>>> I view RDF and related standards in a number of ways, ranging  
>>> from simple application use to AI.  One key part is that I think  
>>> that RDF is the next logical step past XML in data flexibility  
>>> and clarity of representation.
>>
>> This is by no means obvious. See the lengthy thread starting:
>>     <http://www.w3.org/mid/486A0516.2010702@gmail.com>
> Many in that thread, except you, agree with me.

So?

>   (I waded in 20 messages or so, until it was waist deep in  
> epistemology.)  You make good points, but don't convince me that my  
> comment above is wrong, at least in a wide range of applications.

The burden of proof is on you, not me. This is true inherently (as  
you are making a very strong claim) and in particular here (since  
I've provided rather detailed argument and you just got your bald  
assertion).
[snip]

>> The issue here is that XML fails to abstract the data from the
>> representation as effectively as RDF and RDBMS. In this sense, RDF  
>> and
>> RDBMS are better data representations than XML. "
>>
>>> Especially when creating integration and database mechanisms.   
>>> From a certain point of view, XML seems like a special case of an  
>>> RDF equivalent representation.
>>
>> I don't think there's a meaningful sense which XML is a special  
>> case of RDF. Even if you just consider data structures and claim  
>> that XML (qua "trees") is a special case of RDF (qua "graphs"),  
>> it's just not very helpful
> If you consider what a particular fragment of application-oriented  
> XML means, you can define a mapping of the implicit and explicit  
> meaning to equivalent RDF.

Since, in the thread above, someone did an encoding of XML into RDF,  
obviously I know that. Perhaps you should consider that fact and  
wonder why it might not be convincing. Indeed, in the other case it  
shows that RDF can, inherently, be *no better* than XML (since you  
can pop some xml right in).

(And it's definitely not equivalent unless you do a *hell* of a lot  
of work. For example, XML DOMs have order on elements. How do you  
capture that?)
[Too many scare quotes]

[snip]
>>> (Just kidding.  Maybe better: Have you noticed how inefficient it  
>>> is?)
>>
>> I think you conflate implementation with the problem.
> You're not clear here.

Sure I am.

>   You seem to be implying, but haven't proven, that the  
> inefficiency is only due to implementation errors and not to the  
> algorithm / architecture that is required for any implementation.

If you take a gziped ntriples file. Unzip it. Then load the ntriples  
file. You will do worse than parsing the ntriples file. Duh. But  
that's clearly not what's being proposed.

Oh, and someone in the thread did try it:
	<http://www.w3.org/mid/C0DBC26F-F7F8-4356-8C74-15C2D46E03E5@twinql.com>

Now, in that case, the gzipping is not faster. If they are being  
clever (i.e., buffering the data in the right ways) about things then  
that would be proof that it's no better. Absent more details it's  
hard to say.

But it does show that it is, at least, not a killer.

>>> Compare loading 1MB in 128K chunks as binary data with loading  
>>> 1MB of RDF data, or even 1MB of gzipped RDF data.  What's the  
>>> multiple?
>>
>> I feel very confident that I can trivially make this come out  
>> however you like. For example, If the binary data is something  
>> encrypted and compressed due to a very complex algorithm and  
>> requires regenerating very large and complex structures it'll be  
>> crushed by parsing RDF to a list of triples.
> And I can prove that you can load an arbitrarily large blob of  
> binary data in an architecturally neutral format,

Like, oh, a database.

> then retrieve and update only fields you are interested in with  
> little or no parsing and an optional index.  Where "n" is  
> application operations and "m" is parse steps: Besides data  
> movement, order O(n).  While with standard-use XML or RDF, you will  
> have to parse the whole document, with zillions of memory  
> allocations, copies, data binding steps and then have to reverse  
> the process to write it back out.  Order O(n*m).  In that sense, at  
> least, the latter is worst case.

I think I made it clear earlier that in a particular application, a  
custom format can be much better. I believe I referenced databases,  
swi prolog, and Smalltalk images. I'm very skeptical that there will  
be any push for this as a generic interchange format. Which I presume  
what is what we are talking about.

[snip]
>>> You may argue that it's a whole different thing.  I would argue  
>>> that A) that's not necessarily true and B) loading the binary  
>>> data is the theoretical maximum to determine efficiency rating
>>
>> It would be nice if you argued this point instead of asserting it.  
>> Preferably with empirical data. And, of course, we don't typically  
>> need a maximally efficient (unless we also through in human effort  
>> as a component).
> The EXI work has good empirical data for a certain range for the  
> XML case.  I'm working to provide more for that and for the ERI  
> analogue.  Many things "need" to be maximally efficient, they just  
> don't get it.  As EXI has shown, you can get to or essentially to  
> hand-optimized data structure efficiency with the right approach.   
> And EXI only goes so far, there is a potential layer that can do  
> even better.

Pointers? That sounds interesting.

It's still unclear that it's worth the effort here. I don't if  
Olivier has a specific application in mind or is just generally curious.

>>> and C) parsing and serializing data as it is currently done for  
>>> both XML and RDF is worst case in some important ways.
>> Like what? Data is better than vague claims.
> See above.

It would be helpful to me to have a bit more to go on then the "EXI  
work". Hmm. Is this what you mean:
	http://www.w3.org/TR/2007/WD-exi-measurements-20070725/

(That seems respectable and interesting. I'm surprised that given:
	http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#methodology- 
fidelity
you would write the stuff about RDF you wrote above. Interesting.)

It'll take me a while to assimilate that.

[snip]
>>> In my early binary/efficient XML work, I imagined a number of  
>>> ways to make XML faster while staying text-based.  There are too  
>>> many good ways to improve it without that constraint, as we've  
>>> shown in W3C XBC and EXI working groups.  Text-formatted data,  
>>> outside of a few special cases, is much slower to process than  
>>> some good alternatives.
>>
>> Again, we're not being especially grounded here. At the moment, I  
>> don't think that RDF loading from text formats is a key issue  
>> (although I have been bit by multi-day loads :( But that was  
>> really a matter of a bad implementation.)
> Many years ago, I replaced an assembly bubble sort with a Quicksort  
> that I wrote in MBasic in DOS 3.3.  My interpreted implementation  
> ran many times faster, obviously.  At some point, it has to be  
> admitted that only implementations that cheat in some way are going  
> to be as fast as a better way of doing something.  If you are going  
> to cheat, at least think it through, do it in the best area of the  
> architecture, and solve the problem generally and reusably.

Let me rephrase things: If Olivier is facing a special problem in  
loading stuff at the moment, and is loading from RDF/XML files, and  
can't do a preload into e.g., a database and be done with it, then  
the first move is to try n-triples if only because that's very little  
work. The next move (imho) is to try some sort of compression (which,  
to be interesting, has to have some intervention in the parser).  
That's more work, but much less than what is represented by EXI.

So if he has an application problem right now, ntriples+gzip seems a  
sensible way to go. If he is interested in studying things, then some  
comparable effort to EXI is needed.

>>>   One big point however with having "efficient XML" and  
>>> "efficient RDF" is the ability to express the same data in text  
>>> or binary form, including the ability to losslessly "round  
>>> trip".  Some more-purists want to "defend" XML by insisting that  
>>> any other encoding is "not XML" and would pollute / confuse the  
>>> market too much, detracting from the success of XML.  Some of us  
>>> think that having an encoding that is more widely usable in more  
>>> application situations while also improving many existing  
>>> application uses and being equivalent to the text-based standard  
>>> only improves the value of XML.  I feel the same would hold true,  
>>> more so, for RDF.
>>
>> Without specifics, this isn't at all clear.
> There are plenty of specifics for the EXI case over at: http:// 
> www.w3.org/XML/EXI/  (And be sure to dig into the earlier XBC work  
> for a deeply considered analysis of the whole problem.)

Thanks for the pointers!

>>> Which is not to say, by me anyway, that new insights to desirable  
>>> features might not come up in the process that weren't apparent  
>>> in a text-only context.
>>>>> So I came to another question:
>>>>> Is there a computer-optimized format for RDF?
>>>>> Something that would make it load much faster.
>>>>
>>>> For small numbers of triples you may be right, but (as Bijan  
>>>> says) gzipped n-triples are probably adequate.
>>> Gzipping, as with XML, only increases the CPU / memory required.
>>
>> Perhaps CPU, but not necessarily memory. And this isn't typically  
>> CPU bound.
> We have fast CPUs now, however this will always increasingly be a  
> problem until we have a better architecture.
> If you feed data to a gzip layer, or pull from it, without an  
> intermediate buffer, then you are not necessarily using much more  
> memory, true.

Even with a buffer it can be better (though you still have to be  
careful). Imagine 50% compression (unrealistically low). Now, give  
yourself a 1 meg buffer (heck, that's still l2). Now It's still about  
half the disk access. Things are going to have to suck hard before  
that's slower. (Though, of course, it may not be optimal.)
[snip]
>> I'd be surprised, very surprised if CPU or memory *bandwidth* was  
>> at all an issue or an issue that wasn't greatly improved by  
>> gzipping. And this ignores that time might be dominated by  
>> building not particularly portable structures (e.g., indexes).
> There's no reason that indexes or anything else in a data format  
> can't be portable.

Of course. But they *aren't* right now. Thus you need at lot more  
effort to get agreement and deployment.

>   Many formats are already byte swapped, for instance, and many  
> more or bit-packed, full of not-byte-sized integers.
>>>   Note that a properly designed format may get the benefits of  
>>> gzip-like compression without actually incurring nearly as much  
>>> cost as such a generic, search-based algorithm, while possibly  
>>> incurring much less decode effort.
>>
>> But requires considerably more design and implementation effort.
> Absolutely.  But less work at the application level to try to speed  
> things up potentially.
>>> Sandro Hawke wrote:
>> I actually wrote most of the text here :)
> Sorry, just worked with the message I replied to I think.
>>>>>> For large numbers of triples, in my limited experience, the  
>>>>>> things that affect RDF load speed
>>>>> Ooo, I got a bit side tracked by the parsing bit.
>>>>>>
>>>>>> are: The speed of your disk. The size of your memory. Building  
>>>>>> indexes. Duplicate suppression (triple, node, whatever). BNode  
>>>>>> handling. IRI and datatype checks (if you do them). Parsing.  
>>>>>> Now parsing is a factor, but it's fairly minor compared with  
>>>>>> the basic business of storing the triples.
>>>>> Indeed.
>>> Storing them in memory is not nearly as expensive as parsing or  
>>> serialization.
>>
>> This is counter to voiced experience. Sometimes it is; sometimes  
>> it isn't.
> Can be...  Certainly expensive data binding can eat a lot.
>>>   Both of those steps are expensive and adding gzip only  
>>> increases the expense.
>> [snip]
>>
>> Clearly false. I mean, c'mon. It's just the case that there are  
>> situations that compressing increases raw parsing/serialization  
>> performance. If you compress a file 90% and decompress it as part  
>> of the parsing process so that your io cost shrink by 90% it's  
>> clear that you're going to do better. (And the more you can get in  
>> e.g., L2 the better as well.)
> It decreases the data movement cost but not the parsing/ 
> serialization itself.

Which is why I added n triples to the mix. N-triples is a line  
oriented, trivial to parse format.

> Clearly it is true that adding another CPU intensive task without  
> changing the existing ones increases CPU cost.  That is different  
> from decreasing latency by requiring less data to be read or  
> written, and that may even decrease a significant CPU cost if I/O  
> is expensive for some reason.  That total decreased latency is  
> offset by the increased CPU load, and may make things worse if I/O  
> isn't a significant issue, such as memory to memory processing.   
> The latter is and/or is likely to become common in situations where  
> many modules are communicating in the most standard and loosely  
> coupled way.  Now, they either have to exchange text XML/RDF or  
> some not-standard memory format.

Sure, but I generally understand "load" to mean something like "I've  
a big ass file on disk, perhaps that someone sent me, and i want to  
process/query it so need to load it into my triple store". Other  
scenarios will have different characteristics.

>> (A quick google found:
>>     http://www.lst.inf.ethz.ch/research/publications/publications/ 
>> USENIX_2005/USENIX_2005.pdf
> Refers to compressing pages at the OS level to avoid paging.  Not a  
> solution here.
>>     http://support.citrix.com/article/CTX112407
> Refers to compressing session stream with small buffers (memory  
> only) vs. large buffers (disk cache of past cases), not applicable.
>>     http://www.cs.utexas.edu/users/oops/papers.html esp. paper 5)
> Refers to compressing disk cache blocks, also not applicable.

I think you miss my point with these references. I was trying to show  
that compression can be a performance enhancer in relevantly similar  
situations.

>> Now, when it's the case and when it isn't is an important question  
>> and dependent on the application. The advantage of gzipping is  
>> that it's easy to get going. For many cases it more than likely is  
>> sufficient.
> Until there is a better alternative.  There's nothing wrong with  
> using gzip compression, except when there is a more direct  
> alternative available.

Ok, now we're back to the perspective I was coming from.

>   I'm working on it.
>
> In other words, it is only wrong in the sense that it is not the  
> most overall optimal way to solve the entire data representation /  
> exchange / binding problem.

I think we agree there.

Well, I certainly got an interesting read out of this, so I'm happy :)

Cheers,
Bijan.

Received on Thursday, 24 July 2008 21:03:00 UTC