Re: [ANN] XSDBench XML Schema Benchmark 1.0.0 released from Boris Kolpackov on 2006-10-18 (xmlschema-dev@w3.org from October 2006)

From: Boris Kolpackov <boris@codesynthesis.com>
Date: Wed, 18 Oct 2006 14:12:52 +0200
To: Michael Kay <mike@saxonica.com>
Cc: "'Boris Kolpackov'" <boris@codesynthesis.com>, xmlschema-dev@w3.org
Message-ID: <20061018121252.GB27505@karelia>

Hi Michael,

Michael Kay <mike@saxonica.com> writes:

> You say:
>
> "We expect that in most applications the structure validation
>        overhead will greatly outweigh that of the content validation."
>
> Why do you expect that? I would have expected exactly the opposite. A
> process that make one decision per element node in the document is surely
> likely to be faster than one that has to examine each character.

I think you would agree that values of datatypes which require examination
of every character in order to be validated (e.g., numbers, token, name,
enum, regex) tend to be rather short. So the ratio of data that will need
to be examined character-by-character to the total XML document size should
generally be rather small.

I compared the results of the test for Xerces-C++ with validation enabled
and disabled (remember the schema does not use anything except xsd:string
so it is "pure" structure validation). It came out that about 60% is spent
on XML parsing and 40% on structure validation.

Now if we could compare XML parsing to content validation, we could get
an idea of whether structure validation is more expensive. I would say
(proper) XML parsing would be a lot more expensive than content
validation because:

 1) XML parser has to examine the whole document character-by-character
    which is a lot more than what will be validated (see above).

 2) XML parser will need to convert XML document encoding to the parser's
    internal encoding (in case of Xerces-C++ it is from UTF-8 to UTF-16).

 3) XML parser will need to allocate memory for element/attribute
    names and their values. Most of content validation can happen
    without allocating any extra memory.

> Of course it may be true that most of the content is xs:string, but who
> knows.

I think most of the content is string, numbers and enums/regex's now and
then. But I agree it is all pure speculation until we run some tests.

-boris

--
Boris Kolpackov
Code Synthesis Tools CC
http://www.codesynthesis.com
tel: +27 76 1672134
fax: +27 21 5526869

Received on Wednesday, 18 October 2006 12:20:24 UTC