Re: sml schema regular expression performance from Michael Kay on 2012-09-14 (xmlschema-dev@w3.org from September 2012)

From: Michael Kay <mike@saxonica.com>
Date: Fri, 14 Sep 2012 18:11:27 +0100
To: xmlschema-dev@w3.org
Message-ID: <5053653F.7000701@saxonica.com>

I don't understand what you are saying.

XML Schema is a specification, not an implementation. The regex language 
it defines has no differences from other popular regex languages that 
would make it less efficient.

The decision to anchor regexes by default has nothing to do with 
efficiency or performance: it's all about usability. A typical regex in 
XSD might be one that is used to validate that postcodes take the form 
XX99 99X. The rule that the user wants to enforce is that the string as 
a whole should match the pattern, not that some substring should match 
the pattern. It's very rare to see a validation rule where anchoring 
isn't appropriate.

Clearly there is no difference in performance between using a regular 
expression [A-Z]{3}[0-9]{3} that is implicitly anchored, and using a 
regular expression ^[A-Z]{3}[0-9]{3}$ that is explicitly anchored. It's 
only a usability difference.

Similarly, in the rare cases where you want an unanchored match, say to 
test that a string contains at least one occurrence of "(...)", a 
processor that chooses to do so can easily recognize the pattern 
".*\(.*\).*" and strip off the leading and trailing ".*" and then do an 
unanchored match if it thinks that the regex library will be able to 
process it more quickly this way. Whether this is the case is of course 
likely to vary from one regex engine to another.

In practice XSD regexes are usually used to validate fairly short 
strings, so it's very rare to experience performance problems. Some 
users do put monstrous regular expressions in their schemas, but 
performance depends much more on the length of the input string than on 
the complexity of the regex. Paradoxically, because problems are rare, 
and regex evaluation isn't usually on the critical path, there's 
actually little incentive for XSD implementors to put a lot of effort 
into regex performance improvement.

Michael Kay
Saxonica

On 11/09/2012 23:03, Armishev, Sergey wrote:
>
> Hi, XML schema experts.
>
> I have question regarding Regular Expression performance. I received 
> the opinion that XML schema using very inefficient engine.
>
> Below is a statement. Is it TRUE or FALSE and WHY
>
> when it comes to regex, XSD has one thing VERY WRONG:
>
> anchoring regexes is NOT a way to make them perform better!
>
> Ever heard of "first character optimization" in regex engines? Pretty 
> much all engines _on earth_ have that nowadays. And they perform 
> _worse_ when the regex is surrounded with .*.
>
> End of citation.
>
> What is your take on this?
>
> -Sergey
>

Received on Friday, 14 September 2012 17:11:51 UTC