Re: [ANN] XSDBench XML Schema Benchmark 1.0.0 released from noah_mendelsohn@us.ibm.com on 2006-10-19 (xmlschema-dev@w3.org from October 2006)

From: <noah_mendelsohn@us.ibm.com>
Date: Thu, 19 Oct 2006 14:27:12 -0400
To: Boris Kolpackov <boris@codesynthesis.com>
Cc: "'Boris Kolpackov'" <boris@codesynthesis.com>, Michael Kay <mike@saxonica.com>, xmlschema-dev@w3.org
Message-ID: <OFCDC69AF6.669E84E8-ON8525720C.0064A7F9-8525720C.00655E0F@lotus.com>
Boris Kolpackov writes:

> I think you would agree that values of datatypes which require 
> examination
> of every character in order to be validated (e.g., numbers, token, name,
> enum, regex) tend to be rather short. So the ratio of data thatwill need
> to be examined character-by-character to the total XML document
> size should
> generally be rather small.

This seems to me a slightly odd way of splitting things.   The real 
question is, I think, what's your input format?  If it's a serialized XML 
document, say in UTF-8 or UTF-16, then to make sure it's well formed and 
valid, you have to touch each character at least once.   Indeed, the whole 
point of our earlier-referenced XML Screamer work was to make sure you can 
come as close as possible to touching each such character no more than 
once. 

It is true that on certain CPU architetures there are string compare 
instructions that may run somewhat faster than even carefully written 
character by character loops.  With great care, those instructions can 
sometimes be brought to bear on, for example, making sure an end tag 
matches a start tag.  Sometimes you can use scanning instructions to scan 
more quickly for the delimiters that terminate a string or CDATA, but it's 
rare to find scan instructions that are both flexible enough to match all 
needed delimiters, and fast enough to be worth using. 

Overall, I think it's reasonable to assume that a parser needs to inspect 
each character of its input at least once, and preferably no more than 
once.  That's true in principle whether you're checking element names, 
angle brackets, CDATA content, integers, or input that matches a regular 
expression pattern.  I thing our work on XML Screamer shows that, with 
care, you can come interestingly close to the ideal of touching each 
character just once, but it takes some care.

Of course, the whole analysis changes if what you're validating is not a 
real input document, but some form of Infoset.  If you've already built a 
DOM, for example, then the additional work involved in doing validation 
will depend a lot on the ratio of markup to text content, on the simple 
types used, and especially on whether you've done something like 
"interning" the strings of the element names as you built your DOM.  If 
you've done that, then you can validate element names by just comparing 
their interned keys, which are probably just integers or pointers.

Noah

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------








Boris Kolpackov <boris@codesynthesis.com>
Sent by: xmlschema-dev-request@w3.org
10/18/2006 08:12 AM
 
        To:     Michael Kay <mike@saxonica.com>
        cc:     "'Boris Kolpackov'" <boris@codesynthesis.com>, 
xmlschema-dev@w3.org, (bcc: Noah Mendelsohn/Cambridge/IBM)
        Subject:        Re: [ANN] XSDBench XML Schema Benchmark 1.0.0 
released


Hi Michael,

Michael Kay <mike@saxonica.com> writes:

> You say:
>
> "We expect that in most applications the structure validation
>        overhead will greatly outweigh that of the content validation."
>
> Why do you expect that? I would have expected exactly the opposite. A
> process that make one decision per element node in the document is 
surely
> likely to be faster than one that has to examine each character.

I think you would agree that values of datatypes which require examination
of every character in order to be validated (e.g., numbers, token, name,
enum, regex) tend to be rather short. So the ratio of data that will need
to be examined character-by-character to the total XML document size 
should
generally be rather small.

I compared the results of the test for Xerces-C++ with validation enabled
and disabled (remember the schema does not use anything except xsd:string
so it is "pure" structure validation). It came out that about 60% is spent
on XML parsing and 40% on structure validation.

Now if we could compare XML parsing to content validation, we could get
an idea of whether structure validation is more expensive. I would say
(proper) XML parsing would be a lot more expensive than content
validation because:

 1) XML parser has to examine the whole document character-by-character
    which is a lot more than what will be validated (see above).

 2) XML parser will need to convert XML document encoding to the parser's
    internal encoding (in case of Xerces-C++ it is from UTF-8 to UTF-16).

 3) XML parser will need to allocate memory for element/attribute
    names and their values. Most of content validation can happen
    without allocating any extra memory.

> Of course it may be true that most of the content is xs:string, but who
> knows.

I think most of the content is string, numbers and enums/regex's now and
then. But I agree it is all pure speculation until we run some tests.


-boris

--
Boris Kolpackov
Code Synthesis Tools CC
http://www.codesynthesis.com
tel: +27 76 1672134
fax: +27 21 5526869
Attachments

application/octet-stream attachment: signature.asc
Received on Thursday, 19 October 2006 18:27:26 UTC