Numeric comparision benchmarks (C++) from Arnold, Curt on 2000-02-16 (www-xml-schema-comments@w3.org from January to March 2000)

From: Arnold, Curt <Curt.Arnold@hyprotech.com>
Date: Tue, 15 Feb 2000 17:17:21 -0700
To: "'xerces-dev@xml.apache.org'" <xerces-dev@xml.apache.org>, "'www-xml-schema-comments@w3.org'" <www-xml-schema-comments@w3.org>
Cc: "Lojek, Bob" <Bob.Lojek@software.aeat.com>
Message-ID: <00E567D938B9D311ACEC00A0C9B468730C758D@THOR>

Some background:

I've been lobbying the XML Schema working group to reestablish the "real" datatype that had been in Datatypes until the 17 Dec 1999 draft. Prior to the 17 Dec draft, there had been a "real" datatype
that was a arbitrary range and precision floating point value. There was a minAbsoluteValue facet that I didn't like (for details
see:(http://lists.w3.org/Archives/Public/www-xml-schema-comments/1999OctDec/0024.html).

In the 17 Dec draft, the minAbsoluteValue facet disappeared, the double and float datatypes were added as primitive datatypes (corresponding to IEEE) and the "real" datatype was removed. I don't have
any problems with the first two, but I still think that there is substantial justification for a arbitrary range and precision floating point.

One of the points (for others see http://lists.w3.org/Archives/Public/www-xml-schema-comments/2000JanMar/0133.html), I tried to make was that a lexical-based comparision for evaluation of min and max
constraints could be substantially faster than conversion to double and then IEEE floating point comparision. And that unless you really desired to mimic the rounding effects of IEEE, you should not
have to pay the penalty. Basically that a lexical-based comparision would see 1.00000000000000000000000001 as greater than one and would accept it if you had a <minExclusive value="1.0"/>. A IEEE
based comparision would have to realize that 1.00000000000000000001 rounds to 1.0 and that 1.0 is not greater than 1.0.

Unfortunately, I could not provide benchmarks to quantify the difference. So today, I finally did some benchmarks. Hopefully, these can be useful independent of the schema debate. All times are
reported in GetTickCount() values for a Pentium II 400 running Windows 2000 Profession with code compiled using VC6 with the default Release settings.

First, was a timing of 600000 floating point comparisions against 0 (0 was converted or parsed outside the loop). All values were within the range and precision of double and no NaN's, Infinity's or
illegal lexicals were in the test set.

a) 600000 Unicode strings compared using a home-grown lexical comparision : 300 ticks
b) Equivalent ASCII strings converted using atof() and compared: 5308 ticks
c) Equivalent ASCII strings converted using sscanf() and compared: 6379 ticks
d) Unicode strings converted using VarR8FromStr (COM Automation support routine): 1552 ticks

I wasn't successful in the time I had allotted to benchmark conversion using std::basic_istream<XMLCh>.

So conversion was between 5-20 times slower than a lexically-based comparision. VarR8FromStr is much more efficient than the C RTL's atof() function, but at the cost of platform independence.

atof() or sscanf() will also not give you consistent results if C++'s double is not IEEE on the particular platform where the lexical comparision should give identical results on all platforms.

To put this in perspective to parsing time, I benchmarked reading a data file for one of our programs (262000 numbers compared against a preconverted 0) in various modes.

Expat, no numeric comparision: 2610
Expat, lexical comparision: 2804
Expat: VarR8FromStr: 3525
Expat, atof comparision: 5384

Xerces (1_1_0_d05), non-validating, no comparision: 6679
Xerces, validating, no comparision: 6789
Xerces: Non-Validating, lexical range: 8800
Xerces: Validating, lexical: 9000
Xerces: Validating, VarR8: 9900
Xerces: Non-Validating, VarR8: 9850

So, using conversion for bound checking can almost than double the parse time for a numerically intensive file.

I'd like to see:

1) Putting a real type back in, make float, double and decimal derived from real
2) Using lexical validation for real
3) Use lexical validation as a first (and typically only) phase for validation of float and double types. From the lexical validation, you can detect if you are potentially in an area where rounding
could give you difference answers. Since rounding should affect only a tiny fraction of comparisions, then a conversion-based comparision would rarely ever be needed. Alternative, min and max
constraints could be explicitly stated in the schema draft to be performed lexically.
4) Don't depend on C RTL for conversion of double for type-aware DOM's etc.

I'll make my lexical comparision code available to the Apache project, if anyone is interested.

Received on Tuesday, 15 February 2000 19:20:16 UTC