RE: Question about baseline textual parser performance (wrt binary ones)

Hello Tatu,

Many thanks for taking an interest in the work of the EXI working group.

You wrote: 
> Apologies if this has been asked earlier, but after reading the
> published draft, I noticed that comparisons seemed to only include
> parsers expected to be faster than the commonly used one. I can
> understand the desire to keep number of implementations measure
> limited, but I was hoping that in addition to "best of the best",
> couple of most commonly used parsers (like, Xerces-J) could also be
> included.

Our measurements include both the JAXP parser (i.e. the standard JDK
parser, whatever that happens to be) and an optimized parser. The
rationale for using the optimized parser as reference in the
measurements note is that, as EXI will introduce a new and optimized XML
serialization format, we need to proove that the performance improvement
derives actually from the new format as opposed to from merely improved
implementation techniques. One way to achieve this is to compare the
respective best of breed. A slightly different way of putting it is: A
performance oriented developer would almost certainly use an optimized
implementation before considering a change in the underlying format. Our
paper answers the question of how much more such a developer could
expect by changing to the EXI format instead.

> My own selfish motivation is that this would also allow me to compare
> relative performance of the java xml parser I am mostly working on
> (Woodstox), even  if I couldn't get access to (or have time to get
> ones written in other languages) the fastest ones included in exi
> experiments. For example, observing that the performance difference
> between Xerces and Woodstox appears to be somewhere between 20 - 40%
> would allow me to infer approximate ratios to faster parsers.

The actual performance differential varies greatly between test
documents and use cases.

>From a slightly older test run: The (harmonic) mean over a large range
of documents for JAXP and XALS lists 8.6 tps and 10.2 tps, respectively.
(tps: transactions per second. The test suite measures and reports
results as the number document parses over a given time.) However, those
results are skewed by a number of pathological test cases which consist
almost entirely of character data with very little actual XML in them.
When browsing over the individual test cases I regularly see a factor of
1.5x, and there are a number of test cases where we see a factor of 2x.
(Over real world test data, representing various use cases.)

> So, is there a chance that one or two of the most commonly (if not
> fastest) used compliant xml parsers could also be included, for
> baselining purposes?

I suspect we won't be incorporating additional parsers into the test
suite at this stage, but I would hope that by including JAXP parsing we
already meet your requirement.

I will leave it to the paper's editors to decide how exactly the data
will be presented. I assume that we will publish more detailed test data
when we have more complete and stable measurement results, so that you
could do your own analysis. Unfortunately I do not yet know when, how
much, and in which form.


Tatu, I hope this helps. Please let me (and the list) know if you have
additional questions.


Sincerely,
Daniel Vogelheim

Received on Monday, 16 October 2006 01:20:05 UTC