- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 19 Oct 2006 14:27:12 -0400
- To: Boris Kolpackov <boris@codesynthesis.com>
- Cc: "'Boris Kolpackov'" <boris@codesynthesis.com>, Michael Kay <mike@saxonica.com>, xmlschema-dev@w3.org
- Message-ID: <OFCDC69AF6.669E84E8-ON8525720C.0064A7F9-8525720C.00655E0F@lotus.com>
Boris Kolpackov writes: > I think you would agree that values of datatypes which require > examination > of every character in order to be validated (e.g., numbers, token, name, > enum, regex) tend to be rather short. So the ratio of data thatwill need > to be examined character-by-character to the total XML document > size should > generally be rather small. This seems to me a slightly odd way of splitting things. The real question is, I think, what's your input format? If it's a serialized XML document, say in UTF-8 or UTF-16, then to make sure it's well formed and valid, you have to touch each character at least once. Indeed, the whole point of our earlier-referenced XML Screamer work was to make sure you can come as close as possible to touching each such character no more than once. It is true that on certain CPU architetures there are string compare instructions that may run somewhat faster than even carefully written character by character loops. With great care, those instructions can sometimes be brought to bear on, for example, making sure an end tag matches a start tag. Sometimes you can use scanning instructions to scan more quickly for the delimiters that terminate a string or CDATA, but it's rare to find scan instructions that are both flexible enough to match all needed delimiters, and fast enough to be worth using. Overall, I think it's reasonable to assume that a parser needs to inspect each character of its input at least once, and preferably no more than once. That's true in principle whether you're checking element names, angle brackets, CDATA content, integers, or input that matches a regular expression pattern. I thing our work on XML Screamer shows that, with care, you can come interestingly close to the ideal of touching each character just once, but it takes some care. Of course, the whole analysis changes if what you're validating is not a real input document, but some form of Infoset. If you've already built a DOM, for example, then the additional work involved in doing validation will depend a lot on the ratio of markup to text content, on the simple types used, and especially on whether you've done something like "interning" the strings of the element names as you built your DOM. If you've done that, then you can validate element names by just comparing their interned keys, which are probably just integers or pointers. Noah -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 -------------------------------------- Boris Kolpackov <boris@codesynthesis.com> Sent by: xmlschema-dev-request@w3.org 10/18/2006 08:12 AM To: Michael Kay <mike@saxonica.com> cc: "'Boris Kolpackov'" <boris@codesynthesis.com>, xmlschema-dev@w3.org, (bcc: Noah Mendelsohn/Cambridge/IBM) Subject: Re: [ANN] XSDBench XML Schema Benchmark 1.0.0 released Hi Michael, Michael Kay <mike@saxonica.com> writes: > You say: > > "We expect that in most applications the structure validation > overhead will greatly outweigh that of the content validation." > > Why do you expect that? I would have expected exactly the opposite. A > process that make one decision per element node in the document is surely > likely to be faster than one that has to examine each character. I think you would agree that values of datatypes which require examination of every character in order to be validated (e.g., numbers, token, name, enum, regex) tend to be rather short. So the ratio of data that will need to be examined character-by-character to the total XML document size should generally be rather small. I compared the results of the test for Xerces-C++ with validation enabled and disabled (remember the schema does not use anything except xsd:string so it is "pure" structure validation). It came out that about 60% is spent on XML parsing and 40% on structure validation. Now if we could compare XML parsing to content validation, we could get an idea of whether structure validation is more expensive. I would say (proper) XML parsing would be a lot more expensive than content validation because: 1) XML parser has to examine the whole document character-by-character which is a lot more than what will be validated (see above). 2) XML parser will need to convert XML document encoding to the parser's internal encoding (in case of Xerces-C++ it is from UTF-8 to UTF-16). 3) XML parser will need to allocate memory for element/attribute names and their values. Most of content validation can happen without allocating any extra memory. > Of course it may be true that most of the content is xs:string, but who > knows. I think most of the content is string, numbers and enums/regex's now and then. But I agree it is all pure speculation until we run some tests. -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com tel: +27 76 1672134 fax: +27 21 5526869
Attachments
- application/octet-stream attachment: signature.asc
Received on Thursday, 19 October 2006 18:27:26 UTC