Re: binaryXML, marshalling, and and trust boundaries from Robin Berjon on 2002-12-11 (www-tag@w3.org from December 2002)

From: Robin Berjon <robin.berjon@expway.fr>
Date: Wed, 11 Dec 2002 19:22:35 +0100
To: Tim Bray <tbray@textuality.com>
CC: www-tag@w3.org
Message-ID: <3DF7826B.8070405@expway.fr>

Tim Bray wrote:
> Michael Mealling wrote:
>> I for one would appreciate it. There are several protocols I've been
>> working with that, due to their particular nature, would benefit from an
>> efficient serialization that was very specifically _not_ 'just gzip'.
>> The model we're working with requires the impact to the server to be
>> very low as well since the cost to recover is higher than the cost to
>> requery. If gzip is used then that relationship flipflops and the impact
>> to the entire system is extremely significant. Thus the reason why we
>> keep coming back to WBXML as the solution.
> 
> This is a tough problem.  If the tag density is very high relative to 
> running text, you can try to binary-encode markup with a dictionary 
> (what WBXML does IIRC); of course if you wanted to retain XML's virtue 
> of being self-contained you'd want to include the dictionary in the 
> message, which would blow off most of the benefit in the case the 
> messages are short.  Another approach would be simply to be rigorously 
> minimal in choosing tag names, e.g.
> 
>    <m a="33.34.44.55" from="foo@bar.org"><a u="3" h="ajfoeiw"/></m>
> 
> at which point the savings from compression are less significant.

WBXML[1] is token based, it won't gain you much over that indeed. Other XML 
compression methods can however, but they require a priori knownledge of the 
schema (schema-based binary infoset encoders can usually also encode information 
without the schema, but the gains in size are then similar to gzip; the gains in 
speed being still important). If you know that element m must contain element a 
and must have attributes a and from, you can encode that information in a tiny 
amount of space. Also, if your schema provides typing and pattern metadata, you 
may compress even further. For instance, if 33.34.44.55 is defined as four 
integers between 1 and 256 separated by dots you can encode that over four bytes 
instead of 11 in this instance.

So it all depends on the approach, but the compression and speed gains can be 
substantial. The optional typing can also be beneficial to the folks that only 
work at the infoset level (which is what binary infosets are really made for to 
be honest), avoiding them the cost of the stringification/microparse.

[1] http://www.w3.org/TR/wbxml/

-- 
Robin Berjon <robin.berjon@expway.fr>
Research Engineer, Expway
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488

Received on Wednesday, 11 December 2002 13:23:08 UTC