Re: About computer-optimized RDF format. from Stephen Williams on 2008-07-24 (semantic-web@w3.org from July 2008)

From: Stephen Williams <sdw@lig.net>
Date: Thu, 24 Jul 2008 11:21:46 -0700
To: Bijan Parsia <bparsia@cs.man.ac.uk>
CC: Sandro Hawke <sandro@w3.org>, Damian Steer <pldms@mac.com>, Olivier Rossel <olivier.rossel@gmail.com>, Semantic Web <semantic-web@w3.org>, hpti-tech@hpti.info
Message-ID: <4888C83A.9090104@lig.net>
Bijan Parsia wrote:
> On 24 Jul 2008, at 03:08, Stephen D. Williams wrote:
>
>> I view RDF and related standards in a number of ways, ranging from 
>> simple application use to AI.  One key part is that I think that RDF 
>> is the next logical step past XML in data flexibility and clarity of 
>> representation.
>
> This is by no means obvious. See the lengthy thread starting:
>     <http://www.w3.org/mid/486A0516.2010702@gmail.com>
Many in that thread, except you, agree with me.  (I waded in 20 messages 
or so, until it was waist deep in epistemology.)  You make good points, 
but don't convince me that my comment above is wrong, at least in a wide 
range of applications.

> From: Olivier Rossel <olivier.rossel@gmail.com>
> Date: Wed, 2 Jul 2008 09:43:39 +0200
> Message-ID: <516816970807020043o36c56976p95872b8ed9ae3831@mail.gmail.com>
> "> Any XML instance can be considered a compact, early-bound 
> serialization of
> > an infoset RDF graph.
>
> +1.
> XML is very powerful when it comes to presenting data (because it
> details how data imbricate with each other). But XML is very unnatural
> when it comes to crawling the data in an unexpected and ever-changing
> manner (because XML tree structure is chosen once for all, usually
> before even knowing which queries will be applied upon itself)."
> From: <tim.glover@bt.com>
> Date: Wed, 2 Jul 2008 11:25:02 +0100
> Message-ID: 
> <AEF15555D64C494CA393778177A3A1710452B7CD@E03MVC1-UKBR.domain1.systemhost.net> 
>
> "With more complicated data, the possible XML representations vary in
> different ways, and increase exponentially w.r.t. the number of atoms of
> information.  To extract the data from the XML we have to know the
> detailed representation chosen. Saying we can UNION different queries
> misses the point - we still have to write 3 queries. Saying we can use
> transformations misses the point - we still have to write
> transformations.
>
> The issue here is that XML fails to abstract the data from the
> representation as effectively as RDF and RDBMS. In this sense, RDF and
> RDBMS are better data representations than XML. "
>
>> Especially when creating integration and database mechanisms.  From a 
>> certain point of view, XML seems like a special case of an RDF 
>> equivalent representation.
>
> I don't think there's a meaningful sense which XML is a special case 
> of RDF. Even if you just consider data structures and claim that XML 
> (qua "trees") is a special case of RDF (qua "graphs"), it's just not 
> very helpful
If you consider what a particular fragment of application-oriented XML 
means, you can define a mapping of the implicit and explicit meaning to 
equivalent RDF.  When talking about XML vs. other data representations, 
including schema-informed binary/efficient XML, I described levels of 
information "externalization" of a particular instance relative to the 
complete describing set of "data values, structure, naming, identity, 
types".  An XML instance of a set of "idea values" could be said to be 
equivalent to an RDF representation of those "idea values" minus some 
externalized information, which is only explicit in the code (or person) 
that operates on the XML instance properly.

In a similar way, the RDBMS model of data is "missing" explicit 
representation of relationships between tables and has only a weak 
property relationship for fields / columns to a tuple.  The potential 
for a meaningful join is an implicit relationship that RDF represents 
explicitly.

And just now:
> From: "Booth, David (HP Software - Boston)" <dbooth@hp.com>
>   
> Date: Thu, 24 Jul 2008 16:06:55 +0000
> Subject: RE: About computer-optimized RDF format.
> "On 24 Jul 2008, at 03:08, Stephen D. Williams wrote:
> >
>   
>> > > [ . . . ] I think that
>> > > RDF is the next logical step past XML in data flexibility and
>> > > clarity of representation.
>> > > Especially when creating integration and database mechanisms.  From
>> > > a certain point of view, XML seems like a special case of an RDF
>> > > equivalent representation.
>>     
>
> I agree.  In particular, GRDDL permits you to view arbitrary XML as a custom serialization of RDF.  This is a key concept that can enable looser coupling in SOA:
> http://dbooth.org/2007/rdf-and-soa/rdf-and-soa-paper.htm
> and in general permits one to view RDF as a lingua franca for information exchange, even if the serialization doesn't look like RDF:
> http://dbooth.org/2008/stc/slides.ppt"
>   
Very interesting BTW.
>
>>   Although even more inefficient currently.
>>
>> Damian Steer wrote:
>>> On 23 Jul 2008, at 10:07, Olivier Rossel wrote:
>>>> I was wondering how to improve the loading time of RDF files in
>>>> semantic web frameworks.
>>>> And then came a question: is RDF efficient to load?
>>>> The obvious answer is no.
>>> I'm not sure that is obvious, but go on...
>> Have you done it?  ;-)
>
> I'm very sure he has. As the rest of his email showed.
>
>> (Just kidding.  Maybe better: Have you noticed how inefficient it is?)
>
> I think you conflate implementation with the problem.
You're not clear here.  You seem to be implying, but haven't proven, 
that the inefficiency is only due to implementation errors and not to 
the algorithm / architecture that is required for any implementation.
>> Compare loading 1MB in 128K chunks as binary data with loading 1MB of 
>> RDF data, or even 1MB of gzipped RDF data.  What's the multiple?
>
> I feel very confident that I can trivially make this come out however 
> you like. For example, If the binary data is something encrypted and 
> compressed due to a very complex algorithm and requires regenerating 
> very large and complex structures it'll be crushed by parsing RDF to a 
> list of triples.
And I can prove that you can load an arbitrarily large blob of binary 
data in an architecturally neutral format, then retrieve and update only 
fields you are interested in with little or no parsing and an optional 
index.  Where "n" is application operations and "m" is parse steps: 
Besides data movement, order O(n).  While with standard-use XML or RDF, 
you will have to parse the whole document, with zillions of memory 
allocations, copies, data binding steps and then have to reverse the 
process to write it back out.  Order O(n*m).  In that sense, at least, 
the latter is worst case.  A typical XML DOM library uses 5 times as 
much memory as the size of the XML data read in.  Using SAX et al, for 
the typical application transaction situation, just moves the rest of 
the work to a different, application specific layer, so it doesn't solve 
this in a general way.
>> You may argue that it's a whole different thing.  I would argue that 
>> A) that's not necessarily true and B) loading the binary data is the 
>> theoretical maximum to determine efficiency rating
>
> It would be nice if you argued this point instead of asserting it. 
> Preferably with empirical data. And, of course, we don't typically 
> need a maximally efficient (unless we also through in human effort as 
> a component).
The EXI work has good empirical data for a certain range for the XML 
case.  I'm working to provide more for that and for the ERI analogue.  
Many things "need" to be maximally efficient, they just don't get it.  
As EXI has shown, you can get to or essentially to hand-optimized data 
structure efficiency with the right approach.  And EXI only goes so far, 
there is a potential layer that can do even better.
>> and C) parsing and serializing data as it is currently done for both 
>> XML and RDF is worst case in some important ways.
> Like what? Data is better than vague claims.
See above.  Although this applies more generally, this is a very clear 
example case: Consider a multi-tier processing pipeline of servers with 
particular functions.  Imagine that one performs sales (or other) tax 
processing of potentially large orders.  In this environment, 
milliseconds might be worth large amounts of money (as it is for Wall 
Street firms).  The sales order may have many fields, however this 
particular server only needs to examine two or three and write one.  
Traditional XML or RDF processing is worst case as every field must be 
processed thoroughly.  Clearly, there are data format architectures that 
can meet all or most constraints and do not have this inefficiency.

And yes, this all needs to be proved more definitively with working, 
available code.
>
>>>> Making it readable for humans makes it definitely slower to load in 
>>>> programs.
>>> And I'm not convinced about that, either.
>> In my early binary/efficient XML work, I imagined a number of ways to 
>> make XML faster while staying text-based.  There are too many good 
>> ways to improve it without that constraint, as we've shown in W3C XBC 
>> and EXI working groups.  Text-formatted data, outside of a few 
>> special cases, is much slower to process than some good alternatives.
>
> Again, we're not being especially grounded here. At the moment, I 
> don't think that RDF loading from text formats is a key issue 
> (although I have been bit by multi-day loads :( But that was really a 
> matter of a bad implementation.)
Many years ago, I replaced an assembly bubble sort with a Quicksort that 
I wrote in MBasic in DOS 3.3.  My interpreted implementation ran many 
times faster, obviously.  At some point, it has to be admitted that only 
implementations that cheat in some way are going to be as fast as a 
better way of doing something.  If you are going to cheat, at least 
think it through, do it in the best area of the architecture, and solve 
the problem generally and reusably.
>>   One big point however with having "efficient XML" and "efficient 
>> RDF" is the ability to express the same data in text or binary form, 
>> including the ability to losslessly "round trip".  Some more-purists 
>> want to "defend" XML by insisting that any other encoding is "not 
>> XML" and would pollute / confuse the market too much, detracting from 
>> the success of XML.  Some of us think that having an encoding that is 
>> more widely usable in more application situations while also 
>> improving many existing application uses and being equivalent to the 
>> text-based standard only improves the value of XML.  I feel the same 
>> would hold true, more so, for RDF.
>
> Without specifics, this isn't at all clear.
There are plenty of specifics for the EXI case over at: 
http://www.w3.org/XML/EXI/  (And be sure to dig into the earlier XBC 
work for a deeply considered analysis of the whole problem.)
>> Which is not to say, by me anyway, that new insights to desirable 
>> features might not come up in the process that weren't apparent in a 
>> text-only context.
>>>> So I came to another question:
>>>> Is there a computer-optimized format for RDF?
>>>> Something that would make it load much faster.
>>>
>>> For small numbers of triples you may be right, but (as Bijan says) 
>>> gzipped n-triples are probably adequate.
>> Gzipping, as with XML, only increases the CPU / memory required.
>
> Perhaps CPU, but not necessarily memory. And this isn't typically CPU 
> bound.
We have fast CPUs now, however this will always increasingly be a 
problem until we have a better architecture.
If you feed data to a gzip layer, or pull from it, without an 
intermediate buffer, then you are not necessarily using much more 
memory, true.
>>   It helps with size, which also does help a bit with network latency 
>> in some cases.
>
> It helps with disk access. The less IO you use the better off you are, 
> generally (hence tweaking buffers).
Yes, of course, network=disk in terms of being some kind of secondary 
"storage".
>>   Frequently however, bandwidth isn't the issue, CPU and memory 
>> bandwidth is.  Often they both are.
>
> I'd be surprised, very surprised if CPU or memory *bandwidth* was at 
> all an issue or an issue that wasn't greatly improved by gzipping. And 
> this ignores that time might be dominated by building not particularly 
> portable structures (e.g., indexes).
There's no reason that indexes or anything else in a data format can't 
be portable.  Many formats are already byte swapped, for instance, and 
many more or bit-packed, full of not-byte-sized integers.
>>   Note that a properly designed format may get the benefits of 
>> gzip-like compression without actually incurring nearly as much cost 
>> as such a generic, search-based algorithm, while possibly incurring 
>> much less decode effort.
>
> But requires considerably more design and implementation effort.
Absolutely.  But less work at the application level to try to speed 
things up potentially.
>> Sandro Hawke wrote:
> I actually wrote most of the text here :)
Sorry, just worked with the message I replied to I think.
>>>>> For large numbers of triples, in my limited experience, the things 
>>>>> that affect RDF load speed
>>>> Ooo, I got a bit side tracked by the parsing bit.
>>>>>
>>>>> are: The speed of your disk. The size of your memory. Building 
>>>>> indexes. Duplicate suppression (triple, node, whatever). BNode 
>>>>> handling. IRI and datatype checks (if you do them). Parsing. Now 
>>>>> parsing is a factor, but it's fairly minor compared with the basic 
>>>>> business of storing the triples.
>>>> Indeed.
>> Storing them in memory is not nearly as expensive as parsing or 
>> serialization.
>
> This is counter to voiced experience. Sometimes it is; sometimes it 
> isn't.
Can be...  Certainly expensive data binding can eat a lot.
>>   Both of those steps are expensive and adding gzip only increases 
>> the expense.
> [snip]
>
> Clearly false. I mean, c'mon. It's just the case that there are 
> situations that compressing increases raw parsing/serialization 
> performance. If you compress a file 90% and decompress it as part of 
> the parsing process so that your io cost shrink by 90% it's clear that 
> you're going to do better. (And the more you can get in e.g., L2 the 
> better as well.)
It decreases the data movement cost but not the parsing/serialization 
itself.  Clearly it is true that adding another CPU intensive task 
without changing the existing ones increases CPU cost.  That is 
different from decreasing latency by requiring less data to be read or 
written, and that may even decrease a significant CPU cost if I/O is 
expensive for some reason.  That total decreased latency is offset by 
the increased CPU load, and may make things worse if I/O isn't a 
significant issue, such as memory to memory processing.  The latter is 
and/or is likely to become common in situations where many modules are 
communicating in the most standard and loosely coupled way.  Now, they 
either have to exchange text XML/RDF or some not-standard memory format.
>
> (A quick google found:
>     http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf 
>
Refers to compressing pages at the OS level to avoid paging.  Not a 
solution here.
>     http://support.citrix.com/article/CTX112407
Refers to compressing session stream with small buffers (memory only) 
vs. large buffers (disk cache of past cases), not applicable.
>     http://www.cs.utexas.edu/users/oops/papers.html esp. paper 5)
Refers to compressing disk cache blocks, also not applicable.
> Now, when it's the case and when it isn't is an important question and 
> dependent on the application. The advantage of gzipping is that it's 
> easy to get going. For many cases it more than likely is sufficient.
Until there is a better alternative.  There's nothing wrong with using 
gzip compression, except when there is a more direct alternative 
available.  I'm working on it.

In other words, it is only wrong in the sense that it is not the most 
overall optimal way to solve the entire data representation / exchange / 
binding problem.

sdw
> Cheers,
> Bijan.
Received on Thursday, 24 July 2008 18:22:33 UTC