- From: <noah_mendelsohn@us.ibm.com>
- Date: Mon, 21 Mar 2005 11:23:43 -0500
- To: Chris Lilley <chris@w3.org>
- Cc: Robin Berjon <robin.berjon@expway.fr>, www-tag@w3.org
Chris Lilley writes: > It wasn't provided as one. It was provided as a > caution; when benchmarking the improvement from a > binary XML standard, be sure you are measuring the > effect on total system throughput rather than the > effect on some small part of it, because it might > affect other parts. Sorry, my misunderstanding. I agree completely. Speaking as one who for 3 years led a project on high performance XML processing, that's only one of the many details you have to consider. For example, if you're concerned about the parser in isolation (not the higher level stuff), you still need to use a realistic model of memory: will you be repeatedly validating large numbers of documents? If so, are you going to put them in the same memory locations, or are they likely to be scattered? If scattered (which is typical in many systems), then it's a mistake to benchmark a processor by validating the same document 1000 times in place and dividing the total time by 1000. More realistic tends to be to lay out 1001 copies in separate memory locations, validate the first one and ignore the timing, and then time the next 1000. Why? Many processors have first level caches large enough to fit a few copies of reasonable XML documents. If in your real use the documents are likely to be scattered, then you won't hit the first level cache on the typical access. Yes, this can make a very big difference. In modern machines, RAM is much slower than cache. I've also assumed in this example that it's realistic for the code to be cached, but whether that's true also depends on your intended use. Of course, that's before you get to the fact that I've seen people quote XML benchmarks in which the file access time to read in the document is included in the published parsing times! I don't think we want a big thread here on details of doing careful XML benchmarks; I hope the Binary Characterization WG is doing that, but I appreciate the clarification. I think we'd all agree that doing careful benchmarks depends in part on understanding how your parser will be used, and which aspects of performance you care about. Then again, I think we're at risk missing a bigger point. If a binary XML is going to have the kind of impact that would justify incompatibility, it better offer speed and size improvements that are dramatic even with relatively casual benchmarks. I completely agree that sufficiently sloppy benchmarks can hide almost anything, but I do think the gains in speed and/or size have to be pretty dramatic to make the case for Binary XML as a standard. > nuic> I think you need to be very careful heading > nuic> down this path, depending on your use > nuic> case. The term PSVI in particular relates to > nuic> schema validation. In many cases the reason > nuic> you are doing schema validation is because > nuic> you don't entirely trust the source of the > nuic> data. Once you're doing other aspects of > nuic> validation to check the data, I would claim > nuic> (having built such systems) that type > nuic> assignment is nearly free in many cases. > > I would claim that whether this is 'nearly free' > needs to be measured under controlled conditions, > not merely asserted. I agree, measurement is generally essential. Then again, I can tell you that I have one implementation in which the additional overhead is sufficiently close to zero lines of code that I would be very surprised if the measurements didn't bear out my claim. Unfortunately, I am not in a position to share this code at the moment, but I can say we have been experimenting with compilation-oriented techniques. I don't think it's a great leap to see how in such a system, once you know which element you've got so you can validate it, you know the assigned type statically. Noah -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 --------------------------------------
Received on Monday, 21 March 2005 16:24:22 UTC