Re: About computer-optimized RDF format. from Stephen Williams on 2008-07-25 (semantic-web@w3.org from July 2008)

From: Stephen Williams <sdw@lig.net>
Date: Fri, 25 Jul 2008 01:45:59 -0700
To: Bijan Parsia <bparsia@cs.man.ac.uk>
CC: Sandro Hawke <sandro@w3.org>, Damian Steer <pldms@mac.com>, Olivier Rossel <olivier.rossel@gmail.com>, Semantic Web <semantic-web@w3.org>, hpti-tech@hpti.info
Message-ID: <488992C7.9020005@lig.net>
Bijan Parsia wrote:
> On Jul 24, 2008, at 7:21 PM, Stephen Williams wrote:
>
>> Bijan Parsia wrote:
>>> On 24 Jul 2008, at 03:08, Stephen D. Williams wrote:
>>>
>>>> I view RDF and related standards in a number of ways, ranging from 
>>>> simple application use to AI.  One key part is that I think that 
>>>> RDF is the next logical step past XML in data flexibility and 
>>>> clarity of representation.
>>>
>>> This is by no means obvious. See the lengthy thread starting:
>>>     <http://www.w3.org/mid/486A0516.2010702@gmail.com>
>> Many in that thread, except you, agree with me.
>
> So?
>
>>   (I waded in 20 messages or so, until it was waist deep in 
>> epistemology.)  You make good points, but don't convince me that my 
>> comment above is wrong, at least in a wide range of applications.
>
> The burden of proof is on you, not me. This is true inherently (as you 
> are making a very strong claim) and in particular here (since I've 
> provided rather detailed argument and you just got your bald assertion).
The burden of proof is shared I think.  Existence of a status quo 
doesn't automatically justify itself when discrepancies are pointed 
out.  Practically though, you're right.  Certainly I and any 
co-conspirators that I may have need to provide prove this better.
> [snip]
>
>>> The issue here is that XML fails to abstract the data from the
>>> representation as effectively as RDF and RDBMS. In this sense, RDF and
>>> RDBMS are better data representations than XML. "
>>>
>>>> Especially when creating integration and database mechanisms.  From 
>>>> a certain point of view, XML seems like a special case of an RDF 
>>>> equivalent representation.
>>>
>>> I don't think there's a meaningful sense which XML is a special case 
>>> of RDF. Even if you just consider data structures and claim that XML 
>>> (qua "trees") is a special case of RDF (qua "graphs"), it's just not 
>>> very helpful
>> If you consider what a particular fragment of application-oriented 
>> XML means, you can define a mapping of the implicit and explicit 
>> meaning to equivalent RDF.
>
> Since, in the thread above, someone did an encoding of XML into RDF, 
> obviously I know that. Perhaps you should consider that fact and 
> wonder why it might not be convincing. Indeed, in the other case it 
> shows that RDF can, inherently, be *no better* than XML (since you can 
> pop some xml right in).
We're not being clear about confining what use styles we're talking 
about.  XML can encode anything in RDF and RDF can encode anything in 
XML, in some fashion.  My comments are more toward the typical 
application programmer minimum complexity expression of an application 
problem end of the spectrum of what's possible.  The programmer must 
decide A) how data/objects/metadata will be represented in 
memory/application code and B) how that will be represented in any 
secondary "storage": network, file, database, or parameter to a loosely 
coupled module.  XML and RDF, while being maximally flexible in some 
senses, nevertheless color the minimum complexity data representation 
design for programmers.
> (And it's definitely not equivalent unless you do a *hell* of a lot of 
> work. For example, XML DOMs have order on elements. How do you capture 
> that?)
I wasn't referring to mapping, which may be difficult in a specific 
case, and seems very difficult to automate for a wide range of cases.  I 
was referring to a programmer using one or the other to solve a 
straightforward data integration problem.

Thinking conceptually in terms of the model of triples (and N-tuples 
more generally (but not relational "rows"/tuples)), there are two or 
more methods.  First, you can simply have an attribute (triple) that 
gives order for each instance.  Second, you can have one or two 
attributes that indicate "next", and optionally "previous", which gives 
a linked list.

Note that I said conceptually.  In reality, there is no reason that a 
sophisticated encoding couldn't put them in an array or list, if that 
were most efficient, or create an index that is equivalent to that.  Or 
that the API couldn't support .get[5].
> [Too many scare quotes]
>
> [snip]
>>>> (Just kidding.  Maybe better: Have you noticed how inefficient it is?)
>>>
>>> I think you conflate implementation with the problem.
>> You're not clear here.
>
> Sure I am.
>
>>   You seem to be implying, but haven't proven, that the inefficiency 
>> is only due to implementation errors and not to the algorithm / 
>> architecture that is required for any implementation.
>
> If you take a gziped ntriples file. Unzip it. Then load the ntriples 
> file. You will do worse than parsing the ntriples file. Duh. But 
> that's clearly not what's being proposed.
>
> Oh, and someone in the thread did try it:
>     <http://www.w3.org/mid/C0DBC26F-F7F8-4356-8C74-15C2D46E03E5@twinql.com> 
>
>
> Now, in that case, the gzipping is not faster. If they are being 
> clever (i.e., buffering the data in the right ways) about things then 
> that would be proof that it's no better. Absent more details it's hard 
> to say.
>
> But it does show that it is, at least, not a killer.
>
>>>> Compare loading 1MB in 128K chunks as binary data with loading 1MB 
>>>> of RDF data, or even 1MB of gzipped RDF data.  What's the multiple?
>>>
>>> I feel very confident that I can trivially make this come out 
>>> however you like. For example, If the binary data is something 
>>> encrypted and compressed due to a very complex algorithm and 
>>> requires regenerating very large and complex structures it'll be 
>>> crushed by parsing RDF to a list of triples.
>> And I can prove that you can load an arbitrarily large blob of binary 
>> data in an architecturally neutral format,
>
> Like, oh, a database.
"Database" is a slippery word for varying collections of a large range 
of concepts.  People draw the line between data and database in 
different places all the time.  And other people, perhaps more junior 
programmers, seem to think that certain things (sort? indexing?) should 
only be done via a database (i.e. Oracle, across the wire) rather than 
in memory and code local to a process.  And they think that this will 
improve efficiency.  Still other people, often much more sophisticated, 
have a strong aversion to any feature in a library and data format that 
seems like a "database" function (indexes, update, deltas).

To me, everything is a spectrum.  I can imagine the usefulness and 
efficiency of certain features beyond byte-at-a-time-text-parsing and 
still not be implying that I expect to have to link with oracleserver.a 
(ha!) or mysqlserver.a (doable).
>> then retrieve and update only fields you are interested in with 
>> little or no parsing and an optional index.  Where "n" is application 
>> operations and "m" is parse steps: Besides data movement, order 
>> O(n).  While with standard-use XML or RDF, you will have to parse the 
>> whole document, with zillions of memory allocations, copies, data 
>> binding steps and then have to reverse the process to write it back 
>> out.  Order O(n*m).  In that sense, at least, the latter is worst case.
>
> I think I made it clear earlier that in a particular application, a 
> custom format can be much better. I believe I referenced databases, 
> swi prolog, and Smalltalk images. I'm very skeptical that there will 
> be any push for this as a generic interchange format. Which I presume 
> what is what we are talking about.
Those, and many others, are instructive but not generally useful.
I think that it IS possible to have a very generic solution that has a 
significantly useful footprint.
> [snip]
>>>> You may argue that it's a whole different thing.  I would argue 
>>>> that A) that's not necessarily true and B) loading the binary data 
>>>> is the theoretical maximum to determine efficiency rating
>>>
>>> It would be nice if you argued this point instead of asserting it. 
>>> Preferably with empirical data. And, of course, we don't typically 
>>> need a maximally efficient (unless we also through in human effort 
>>> as a component).
>> The EXI work has good empirical data for a certain range for the XML 
>> case.  I'm working to provide more for that and for the ERI 
>> analogue.  Many things "need" to be maximally efficient, they just 
>> don't get it.  As EXI has shown, you can get to or essentially to 
>> hand-optimized data structure efficiency with the right approach.  
>> And EXI only goes so far, there is a potential layer that can do even 
>> better.
>
> Pointers? That sounds interesting.
>
> It's still unclear that it's worth the effort here. I don't if Olivier 
> has a specific application in mind or is just generally curious.
>
>>>> and C) parsing and serializing data as it is currently done for 
>>>> both XML and RDF is worst case in some important ways.
>>> Like what? Data is better than vague claims.
>> See above.
>
> It would be helpful to me to have a bit more to go on then the "EXI 
> work". Hmm. Is this what you mean:
>     http://www.w3.org/TR/2007/WD-exi-measurements-20070725/
There is a lot in there.  And a lot more in the XBC Properties and 
Measurements Methodology docs about useful properties and how to measure 
them.
http://www.w3.org/XML/Binary/

> (That seems respectable and interesting. I'm surprised that given:
>     http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#methodology-fidelity 
>
> you would write the stuff about RDF you wrote above. Interesting.)
Which part surprises you?
> It'll take me a while to assimilate that.
>
> [snip]
>>>> In my early binary/efficient XML work, I imagined a number of ways 
>>>> to make XML faster while staying text-based.  There are too many 
>>>> good ways to improve it without that constraint, as we've shown in 
>>>> W3C XBC and EXI working groups.  Text-formatted data, outside of a 
>>>> few special cases, is much slower to process than some good 
>>>> alternatives.
>>>
>>> Again, we're not being especially grounded here. At the moment, I 
>>> don't think that RDF loading from text formats is a key issue 
>>> (although I have been bit by multi-day loads :( But that was really 
>>> a matter of a bad implementation.)
>> Many years ago, I replaced an assembly bubble sort with a Quicksort 
>> that I wrote in MBasic in DOS 3.3.  My interpreted implementation ran 
>> many times faster, obviously.  At some point, it has to be admitted 
>> that only implementations that cheat in some way are going to be as 
>> fast as a better way of doing something.  If you are going to cheat, 
>> at least think it through, do it in the best area of the 
>> architecture, and solve the problem generally and reusably.
>
> Let me rephrase things: If Olivier is facing a special problem in 
> loading stuff at the moment, and is loading from RDF/XML files, and 
> can't do a preload into e.g., a database and be done with it, then the 
> first move is to try n-triples if only because that's very little 
> work. The next move (imho) is to try some sort of compression (which, 
> to be interesting, has to have some intervention in the parser). 
> That's more work, but much less than what is represented by EXI.
>
> So if he has an application problem right now, ntriples+gzip seems a 
> sensible way to go. If he is interested in studying things, then some 
> comparable effort to EXI is needed.
>
>>>>   One big point however with having "efficient XML" and "efficient 
>>>> RDF" is the ability to express the same data in text or binary 
>>>> form, including the ability to losslessly "round trip".  Some 
>>>> more-purists want to "defend" XML by insisting that any other 
>>>> encoding is "not XML" and would pollute / confuse the market too 
>>>> much, detracting from the success of XML.  Some of us think that 
>>>> having an encoding that is more widely usable in more application 
>>>> situations while also improving many existing application uses and 
>>>> being equivalent to the text-based standard only improves the value 
>>>> of XML.  I feel the same would hold true, more so, for RDF.
>>>
>>> Without specifics, this isn't at all clear.
>> There are plenty of specifics for the EXI case over at: 
>> http://www.w3.org/XML/EXI/  (And be sure to dig into the earlier XBC 
>> work for a deeply considered analysis of the whole problem.)
>
> Thanks for the pointers!
>
>>>> Which is not to say, by me anyway, that new insights to desirable 
>>>> features might not come up in the process that weren't apparent in 
>>>> a text-only context.
>>>>>> So I came to another question:
>>>>>> Is there a computer-optimized format for RDF?
>>>>>> Something that would make it load much faster.
>>>>>
>>>>> For small numbers of triples you may be right, but (as Bijan says) 
>>>>> gzipped n-triples are probably adequate.
>>>> Gzipping, as with XML, only increases the CPU / memory required.
>>>
>>> Perhaps CPU, but not necessarily memory. And this isn't typically 
>>> CPU bound.
>> We have fast CPUs now, however this will always increasingly be a 
>> problem until we have a better architecture.
>> If you feed data to a gzip layer, or pull from it, without an 
>> intermediate buffer, then you are not necessarily using much more 
>> memory, true.
>
> Even with a buffer it can be better (though you still have to be 
> careful). Imagine 50% compression (unrealistically low). Now, give 
> yourself a 1 meg buffer (heck, that's still l2). Now It's still about 
> half the disk access. Things are going to have to suck hard before 
> that's slower. (Though, of course, it may not be optimal.)
> [snip]
>>> I'd be surprised, very surprised if CPU or memory *bandwidth* was at 
>>> all an issue or an issue that wasn't greatly improved by gzipping. 
>>> And this ignores that time might be dominated by building not 
>>> particularly portable structures (e.g., indexes).
>> There's no reason that indexes or anything else in a data format 
>> can't be portable.
>
> Of course. But they *aren't* right now. Thus you need at lot more 
> effort to get agreement and deployment.
Sure.  It is easy to spend a lot of time even getting something useful 
for yourself, not to mention generalizing it properly.  Which is why it 
is important to do it if you can get the critical mass for success and 
all of the right ideas in one design.  Few people wanted to talk about 
efficient XML processing and size back when it was bugging me in 1998.
>>   Many formats are already byte swapped, for instance, and many more 
>> or bit-packed, full of not-byte-sized integers.
>>>>   Note that a properly designed format may get the benefits of 
>>>> gzip-like compression without actually incurring nearly as much 
>>>> cost as such a generic, search-based algorithm, while possibly 
>>>> incurring much less decode effort.
>>>
>>> But requires considerably more design and implementation effort.
>> Absolutely.  But less work at the application level to try to speed 
>> things up potentially.
>>>> Sandro Hawke wrote:
>>> I actually wrote most of the text here :)
>> Sorry, just worked with the message I replied to I think.
>>>>>>> For large numbers of triples, in my limited experience, the 
>>>>>>> things that affect RDF load speed
>>>>>> Ooo, I got a bit side tracked by the parsing bit.
>>>>>>>
>>>>>>> are: The speed of your disk. The size of your memory. Building 
>>>>>>> indexes. Duplicate suppression (triple, node, whatever). BNode 
>>>>>>> handling. IRI and datatype checks (if you do them). Parsing. Now 
>>>>>>> parsing is a factor, but it's fairly minor compared with the 
>>>>>>> basic business of storing the triples.
>>>>>> Indeed.
>>>> Storing them in memory is not nearly as expensive as parsing or 
>>>> serialization.
>>>
>>> This is counter to voiced experience. Sometimes it is; sometimes it 
>>> isn't.
>> Can be...  Certainly expensive data binding can eat a lot.
>>>>   Both of those steps are expensive and adding gzip only increases 
>>>> the expense.
>>> [snip]
>>>
>>> Clearly false. I mean, c'mon. It's just the case that there are 
>>> situations that compressing increases raw parsing/serialization 
>>> performance. If you compress a file 90% and decompress it as part of 
>>> the parsing process so that your io cost shrink by 90% it's clear 
>>> that you're going to do better. (And the more you can get in e.g., 
>>> L2 the better as well.)
>> It decreases the data movement cost but not the parsing/serialization 
>> itself.
>
> Which is why I added n triples to the mix. N-triples is a line 
> oriented, trivial to parse format.
It's trivial to read a bit or byte at a time, but that isn't necessarily 
efficient.  It is better, yes, than a heavy parser.
>> Clearly it is true that adding another CPU intensive task without 
>> changing the existing ones increases CPU cost.  That is different 
>> from decreasing latency by requiring less data to be read or written, 
>> and that may even decrease a significant CPU cost if I/O is expensive 
>> for some reason.  That total decreased latency is offset by the 
>> increased CPU load, and may make things worse if I/O isn't a 
>> significant issue, such as memory to memory processing.  The latter 
>> is and/or is likely to become common in situations where many modules 
>> are communicating in the most standard and loosely coupled way.  Now, 
>> they either have to exchange text XML/RDF or some not-standard memory 
>> format.
>
> Sure, but I generally understand "load" to mean something like "I've a 
> big ass file on disk, perhaps that someone sent me, and i want to 
> process/query it so need to load it into my triple store". Other 
> scenarios will have different characteristics.
>
>>> (A quick google found:
>>>     
>>> http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf 
>>>
>> Refers to compressing pages at the OS level to avoid paging.  Not a 
>> solution here.
>>>     http://support.citrix.com/article/CTX112407
>> Refers to compressing session stream with small buffers (memory only) 
>> vs. large buffers (disk cache of past cases), not applicable.
>>>     http://www.cs.utexas.edu/users/oops/papers.html esp. paper 5)
>> Refers to compressing disk cache blocks, also not applicable.
>
> I think you miss my point with these references. I was trying to show 
> that compression can be a performance enhancer in relevantly similar 
> situations.
>
>>> Now, when it's the case and when it isn't is an important question 
>>> and dependent on the application. The advantage of gzipping is that 
>>> it's easy to get going. For many cases it more than likely is 
>>> sufficient.
>> Until there is a better alternative.  There's nothing wrong with 
>> using gzip compression, except when there is a more direct 
>> alternative available.
>
> Ok, now we're back to the perspective I was coming from.
>
>>   I'm working on it.
>>
>> In other words, it is only wrong in the sense that it is not the most 
>> overall optimal way to solve the entire data representation / 
>> exchange / binding problem.
>
> I think we agree there.
>
> Well, I certainly got an interesting read out of this, so I'm happy :)

Great!
>
> Cheers,
> Bijan.

sdw
Received on Friday, 25 July 2008 08:46:46 UTC