- From: Boris Kolpackov <boris@codesynthesis.com>
- Date: Wed, 18 Oct 2006 14:12:52 +0200
- To: Michael Kay <mike@saxonica.com>
- Cc: "'Boris Kolpackov'" <boris@codesynthesis.com>, xmlschema-dev@w3.org
- Message-ID: <20061018121252.GB27505@karelia>
Hi Michael, Michael Kay <mike@saxonica.com> writes: > You say: > > "We expect that in most applications the structure validation > overhead will greatly outweigh that of the content validation." > > Why do you expect that? I would have expected exactly the opposite. A > process that make one decision per element node in the document is surely > likely to be faster than one that has to examine each character. I think you would agree that values of datatypes which require examination of every character in order to be validated (e.g., numbers, token, name, enum, regex) tend to be rather short. So the ratio of data that will need to be examined character-by-character to the total XML document size should generally be rather small. I compared the results of the test for Xerces-C++ with validation enabled and disabled (remember the schema does not use anything except xsd:string so it is "pure" structure validation). It came out that about 60% is spent on XML parsing and 40% on structure validation. Now if we could compare XML parsing to content validation, we could get an idea of whether structure validation is more expensive. I would say (proper) XML parsing would be a lot more expensive than content validation because: 1) XML parser has to examine the whole document character-by-character which is a lot more than what will be validated (see above). 2) XML parser will need to convert XML document encoding to the parser's internal encoding (in case of Xerces-C++ it is from UTF-8 to UTF-16). 3) XML parser will need to allocate memory for element/attribute names and their values. Most of content validation can happen without allocating any extra memory. > Of course it may be true that most of the content is xs:string, but who > knows. I think most of the content is string, numbers and enums/regex's now and then. But I agree it is all pure speculation until we run some tests. -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com tel: +27 76 1672134 fax: +27 21 5526869
Received on Wednesday, 18 October 2006 12:20:24 UTC