RE: Andrew Layman and Don Box Analysis of XML Optimization Techni ques from noah_mendelsohn@us.ibm.com on 2005-04-07 (www-tag@w3.org from April 2005)

From: <noah_mendelsohn@us.ibm.com>
Date: Thu, 7 Apr 2005 10:31:18 -0400
To: "Bullard, Claude L (Len)" <len.bullard@intergraph.com>
Cc: Andrew Layman <andrewl@microsoft.com>, "'Don Box'" <dbox@microsoft.com>, "Rice, Ed (HP.com)" <ed.rice@hp.com>, Paul Cotton <pcotton@microsoft.com>, www-tag@w3.org, klawrenc@us.ibm.com, haggar@us.ibm.com
Message-ID: <OFF77B20F0.839960BE-ON85256FDC.004E2326-85256FDC.004FC610@lotus.com>
Len Bullard writes:

> HTTP needed no formal analysis nor test cases. 
> HTML needed no formal analysis nor test cases.
> SOAP needed no formal analysis nor test cases.
> The proof was the use and the rapid deployment 
> with the exception of the third item which is 
> so far, unproven but the market is patient.

With respect, I don't think the measure of success for HTTP, HTML or SOAP 
was primarily performance.   If it were, I would have thought the 
community would have wanted to get quite a bit of shared experience with 
benchmarks and performance models before agreeing to standardization.

> The FastInfoset approach has been privately
> benchmarked and proven to be workable in much the
> same way as the cases given above.  Since faster
> performance is a customer requirement and not a
> theoretical issue, customers can go to the
> innovators who provide the necessary technology.

> That would be, in this case, Sun.  They are of
> course, possibly willing to license that
> technology to their partner in Redmond which has
> slower and late to market technology to assist
> them in coming to market.

 I am aware that Sun has done FastInfoset benchmarks.  Having spent nearly 
4 years leading teams doing high performance XML implementations, I can 
tell you that any benchmarks have to be run with great care.  You need to 
do things like laying out your buffers in patterns that match your likely 
usage patterns, as it affects processor cache hit ratios.  And yes, those 
can make a very noticeable difference.   You also need to choose the 
appropriate text-based parsers against which to compare.   For example, 
Xerces has many wonderful characteristics that make it the right choice 
for many purposes, but it is nowhere near the fastest parser you can write 
for many important high-performance applications.   I'm not implying that 
Sun has or hasn't done a good job on these things, but as with many 
things, it's healthy to have publicly available tests that can be 
reproduced and studied.

In the particular case of FastXML, my understanding is that there were two 
flavors.  One was a schema-dependent implementation that relied on 
agreement between sender and receiver as to the format of the document. 
Tag information was sent only in cases like <choice> where sender and 
receiver could not presume what was to be inferred.  That's an interesting 
design point, but it looses many of XMLs appealing characteristics of 
self-description.  I suspect that it will prove more problematic as we 
start to do more work on versioning and extensibility, and as we see more 
applications exchanging information for which there is only partial 
agreement on the layout.  I understand there was another embodiment of 
FastXML that sent a full infoset, though I'm still unclear on whether it 
depended on type information.  Whether, for example, it could distinguish 
the following two instances:

        <e xsi:type="xsd:integer">123</e>
        <e xsi:type="xsd:integer">00123</e>

To be a true Infoset implementation usable in SOAP, for example, you must 
be able to distinguish the above.  Note that the usual digital signatures 
on these will be different.

Are there published benchmarks of both of the above?  Running in which 
sorts of applications?  Throwing SAX events?  Deserializaing to JAXRPC? 
All of these things make a difference.  That's why we need public 
discussion and debate, based on benchmarks that not only yield good 
numbers, but that can be evaluated by the community to ensure that they 
accurately reflect what are likely to be realistic usage patterns.  Are 
both of the FastXML approaches deemed to be of much higher performance 
than text, or only the schema-dependent one?

Also, while I introduced the mention somewhat jokingly in my intro to 
Andrew's and Don's work, with enough expertise you can actually do some 
semi-formal performance models of these things.   It depends on knowing a 
lot about how your systems and languages run, but in my experience people 
who build high performance implementations over a number of years develop 
fairly good intuitions about where time is going.  For example, knowing 
the performance characteristics of your UTF-8 to UTF-16 conversion 
routines can be a really useful predictor of lower bounds on the 
performance of certain implemenations.  It's usually quite easy to add up 
on a whiteboard how many such conversions, and of what length, will be 
done in various situations.  Likewise for hashtable lookups, string pool 
accesses, etc.   I'd feel better if I saw more such things discussed in 
the quantitatively in the community that's recommending a Binary XML 
standard.

In summary, I think it is important to have a public debate about 
quantitative performance issues, preferably based on carefully run and 
reproduceable benchmarks. 

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------








"Bullard, Claude L (Len)" <len.bullard@intergraph.com>
04/07/2005 09:13 AM

 
        To:     "'Don Box'" <dbox@microsoft.com>, "Rice, Ed (HP.com)" <ed.rice@hp.com>, 
noah_mendelsohn@us.ibm.com, www-tag@w3.org
        cc:     Andrew Layman <andrewl@microsoft.com>, Paul Cotton <pcotton@microsoft.com>
        Subject:        RE: Andrew Layman and Don Box Analysis of XML Optimization Techni ques


HTTP needed no formal analysis nor test cases. 
HTML needed no formal analysis nor test cases.
SOAP needed no formal analysis nor test cases.
The proof was the use and the rapid deployment 
with the exception of the third item which is 
so far, unproven but the market is patient.

The FastInfoset approach has been privately benchmarked and proven 
to be workable in much the same way as the cases given above.  Since 
faster performance is a customer requirement and not a theoretical 
issue, customers can go to the innovators who provide the necessary
technology.

That would be, in this case, Sun.  They are of course, possibly willing 
to license that technology to their partner in Redmond which has 
slower and late to market technology to assist them in coming to market.

len
Received on Thursday, 7 April 2005 14:31:38 UTC