Re: [xml-dev] binary XML API and scientific use cases [Re: [xml-dev] [ANN] nux-1.0beta2 release

Alex, see comments inline below...

On Nov 22, 2004, at 1:01 PM, Aleksander Slominski wrote:

> Wolfgang Hoschek wrote:
>
>> This is to announce the nux-1.0beta2 release 
>> (http://dsd.lbl.gov/nux/).
>>
>> Nux is a small, straightforward, and surprisingly effective 
>> open-source extension of the  XOM XML library.
>>
> hi Wolfgang,
>
> the natural question is: how does it compare to XBIS?

Among other things, we also benchmarked with the test xml files that 
come with XBIS (thanks to Dennis Sonoski for the great work - much 
appreciated). It would be interesting to directly compare performance 
with XBIS, but so far we did not do so, for two reasons:

- XBIS currently does not work with XOM (misses some XMLReader 
features/properties that XOM requires)
- XBIS measures performance from and to SAX event streams. bnux measure 
performance from XOM documents to byte arrays, and back. bnux includes 
XOM tree walking, tree building, and the inherent XOM XML 
wellformedness checks, which is signifcantly more epensive (and also 
more useful, since it measure delivering data from/to a large number of 
real-world applications, rather than low-level SAX apps). In other 
words, the benchmarking methodology is different. It would not be an 
apples to apples comparison. Might still be interesting, though.

>
> can it be divorced from XOM?

The concept is applicable to any DOM-like tree model and probably any 
infoset based model. The implementation is specific to XOM.

>
>
>> Features include:
>>     •     Seamless W3C XQuery and XPath support for XOM, through 
>> Saxon.
>>     •     Efficient and flexible pools and factories for XQueries, 
>> XSL Transforms, as well as Builders that validate against various 
>> schema languages, including W3C XML Schemas, DTDs, RELAX NG, 
>> Schematron, etc.
>>     •     Serialization and deserialization of XOM XML documents to 
>> and from  an efficient and compact custom binary XML data format 
>> (bnux format), without loss or change of any information.
>>     •     For simple and complex continuous queries and/or 
>> transformations over very large or infinitely long XML input, a 
>> convenient streaming path filter API combines full XQuery support 
>> with straightforward filtering.
>>     •     Glue for integration with JAXB and for queries over 
>> ill-formed HTML.
>>     •     Well documented API. Ships in a jar file that weighs just 
>> 60 KB.
>>
>> Changelog:
>>
>> XOM serialization and deserialization performance is more than good 
>> enough for most purposes. However, for particularly stringent 
>> performance requirements this release adds "bnux", an option for 
>> lightning-fast binary XML serialization and deserialization.
>
> did you compare BNUX and XBIS performance?

see above.

>
>> Contrasting bnux with XOM:
>>
>>     •     Serialization speedup: 2-7 (10-35 MB/s vs. 5 MB/s)
>>     •     Deserialization speedup: 4-10 (20-50 MB/s vs. 5 MB/s)
>>     •     XML data compression factor: 1.5 - 4
>>
>> For a detailed discussion and background see 
>> http://dsd.lbl.gov/nux/api/nux/xom/binary/BinaryXMLCodec.html
>>
> XOM is tree model so how do you do streaming - it by streaming partial 
> XOM tree construction/deconstruction when you access data (overriding 
> |endElement()| in |NodeFactory|) and manually keep detach-ing() nodes 
> or just letting them to be GCed?

Currently we do not do streaming.

The bnux serialization algorithm is a three-pass batch algorithm, hence 
  buffer-oriented, not stream-oriented. It has a throughput profile with 
short  critical paths, rather than a low latency profile with long 
critical paths,  rendering it ideal for large volumes of small to 
medium-sized XML documents,  and impractical for individual documents 
that do not fit into main memory.  The bnux deserialization algorithm 
is a single pass algorithm, and could in  theory be streamed through a 
NodeFactory, but the current  implementation does not do so.

The serialization algorithm could be restructured to be a single pass 
algorithm at the expense of compression; performance would probably be 
roughly the same. Turning the single pass algorithm into a chunked 
streaming algorithm using "pages" would be possible but complicated, 
probably reducing performance. We have not tried it, tough.

>
> what are use cases for nux: what do you plan to use it for?

The algorithm is primarily  intended for tightly coupled 
high-performance systems exchanging large  volumes of XML data over 
networks, as well as for compact main memory caches  and for short-term 
  storage as BLOBs in backend databases or files  (e.g. "session" data 
with limited duration).

>
> are use cases related to XML Binary Characterization 
> <http://www.w3.org/TR/xbc-use-cases/>?

They might fit into that "diverse" bag-of-things as well...

>
> i am a bit disappointed that scientific requirements are completely 
> omitted form XBC use cases - the closest i could find is 
> http://www.w3.org/TR/xbc-use-cases/#FPenergy but it skips over whole 
> issue how to transfer array of doubles without changing endianess ...

I may be wrong, but conversion of doubles to strings and back seems the 
main CPU drain here, rather than byte swapping. Try doing this for 
billions of floats, gulp. Hence one would need to ship arrays of 
doubles in IEEE floating point representation or native format to avoid 
string conversions, perhaps most appropriately as an "attachment" 
according to the various related standards out there. When working with 
a binary representation, one could also extend DOM-like APIs in 
somewhat counter-intuitive manners, with subclasses like 
DoubleArrayText, converting from double to IEEE floating point and 
back, or similar.

>
> we did lot of work in past related to XML performance (in Indiana 
> University and Binghamton) and are very concerned that whatever binary 
> XML will be characterized/standardized in W3C will be of no much use 
> for scientific computing and grids ...


You would need strong advocates/evangelists, it seems.

Regards,
Wolfgang.

Received on Tuesday, 23 November 2004 04:04:05 UTC