Re: Binary XML (was: Re: Draft minutes of 15 March 2005 Telcon) from noah_mendelsohn@us.ibm.com on 2005-03-21 (www-tag@w3.org from March 2005)

From: <noah_mendelsohn@us.ibm.com>
Date: Mon, 21 Mar 2005 11:23:43 -0500
To: Chris Lilley <chris@w3.org>
Cc: Robin Berjon <robin.berjon@expway.fr>, www-tag@w3.org
Message-ID: <OF8489B06A.ACB04CD1-ON85256FCB.00589D88@lotus.com>
Chris Lilley writes:

> It wasn't provided as one. It was provided as a
> caution; when benchmarking the improvement from a
> binary XML standard, be sure you are measuring the
> effect on total system throughput rather than the
> effect on some small part of it, because it might
> affect other parts.

Sorry, my misunderstanding.   I agree completely.  Speaking as one who for 
3 years led a project on high performance XML processing, that's only one 
of the many details you have to consider.  For example, if you're 
concerned about the parser in isolation (not the higher level stuff), you 
still need to use a realistic model of memory:  will you be repeatedly 
validating large numbers of documents?  If so, are you going to put them 
in the same memory locations, or are they likely to be scattered?  If 
scattered (which is typical in many systems), then it's a mistake to 
benchmark a processor by validating the same document 1000 times in place 
and dividing the total time by 1000.  More realistic tends to be to lay 
out 1001 copies in separate memory locations, validate the first one and 
ignore the timing, and then time the next 1000.  Why?  Many processors 
have first level caches large enough to fit a few copies of reasonable XML 
documents.  If in your real use the documents are likely to be scattered, 
then you won't hit the first level cache on the typical access.  Yes, this 
can make a very big difference.   In modern machines, RAM is much slower 
than cache.  I've also assumed in this example that it's realistic for the 
code to be cached, but whether that's true also depends on your intended 
use.

Of course, that's before you get to the fact that I've seen people quote 
XML benchmarks in which the file access time to read in the document is 
included in the published parsing times! 

I don't think we want a big thread here on details of doing careful XML 
benchmarks; I hope the Binary Characterization WG is doing that, but I 
appreciate the clarification.  I think we'd all agree that doing careful 
benchmarks depends in part on understanding how your parser will be used, 
and which aspects of performance you care about.

Then again, I think we're at risk missing a bigger point.  If a binary XML 
is going to have the kind of impact that would justify incompatibility, it 
better offer speed and size improvements that are dramatic even with 
relatively casual benchmarks.  I completely agree that sufficiently sloppy 
benchmarks can hide almost anything, but I do think the gains in speed 
and/or size have to be pretty dramatic to make the case for Binary XML as 
a standard.

> nuic> I think you need to be very careful heading
> nuic> down this path, depending on your use
> nuic> case. The term PSVI in particular relates to
> nuic> schema validation. In many cases the reason
> nuic> you are doing schema validation is because
> nuic> you don't entirely trust the source of the
> nuic> data. Once you're doing other aspects of
> nuic> validation to check the data, I would claim
> nuic> (having built such systems) that type
> nuic> assignment is nearly free in many cases.
> 
> I would claim that whether this is 'nearly free'
> needs to be measured under controlled conditions,
> not merely asserted.

I agree, measurement is generally essential.  Then again, I can tell you 
that I have one implementation in which the additional overhead is 
sufficiently close to zero lines of code that I would be very surprised if 
the measurements didn't bear out my claim.  Unfortunately, I am not in a 
position to share this code at the moment, but I can say we have been 
experimenting with compilation-oriented techniques.  I don't think it's a 
great leap to see how in such a system, once you know which element you've 
got so you can validate it, you know the assigned type statically.

Noah

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Monday, 21 March 2005 16:24:22 UTC