RE: sml schema regular expression performance from Liam R E Quin on 2012-09-14 (xmlschema-dev@w3.org from September 2012)

From: Liam R E Quin <liam@w3.org>
Date: Fri, 14 Sep 2012 15:52:19 -0400
To: "Armishev, Sergey" <sarmishev@idirect.net>
Cc: Michael Kay <mike@saxonica.com>, "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
Message-ID: <1347652339.560.53.camel@localhost.localdomain>

On Fri, 2012-09-14 at 19:09 +0000, Armishev, Sergey wrote:

> The arguments against XML schema regular expression performance that I
> cited is that such flavor can't use "first character optimization" .
> Somebody can compare this "first character optimization"  versus
> "efficient text-directed engines"  ?

To clarify further - this is simply not an issue in practice.

If XSD regular expressions were not anchored implicitly, they would
probably *all* contain ^ at the start and $ at the end in almost all
schemas in real life.

It's very rare to say, "this value is only valid if it contains a
decimal point" and very common to say "this value must contain exactly
one decimal point".

So there is no large performance difference in practice, because by the
time the regex engine see it, it's anchored.

In addition, validation is typically going to involve compiling, say, 20
to 500 regular expressions and running each of them (sometimes tens of
thousands of times, sometimes only once) on strings that are often 2, 3,
or maybe five to twenty characters long.

The savings might or might not be worth the cost of the extra I/O in
handling the ^ and $ signs :-) although you could go measure.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Co-perpetrator, 5th edition of "Beginning XML", Wrox, July 2012
http://www.holoweb.net/~liam/ - the barefoot typographer

Received on Friday, 14 September 2012 19:52:55 UTC