W3C home > Mailing lists > Public > xmlschema-dev@w3.org > September 2012

RE: sml schema regular expression performance

From: Armishev, Sergey <sarmishev@idirect.net>
Date: Fri, 14 Sep 2012 19:09:18 +0000
To: Michael Kay <mike@saxonica.com>, "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
Message-ID: <434022B42343C94187A29511DFDA554E16A2B993@VAUS-MBX03.idirect.net>
Thank you, Michael for very detailed answer. Still would like to know about performance. XSD schema does defines its own flavor http://www.regular-expressions.info/xml.html and the statement is
Compared with other regular expression flavors<http://www.regular-expressions.info/refflavors.html>, the XML schema flavor is quite limited in features. Since it's only used to validate whether an entire element matches a pattern or not, rather than for extracting matches from large blocks of data, you won't really miss the features often found in other flavors. The limitations allow schema validators to be implemented with efficient text-directed engines<http://www.regular-expressions.info/engine.html>.

The arguments against XML schema regular expression performance that I cited is that such flavor can't use "first character optimization" . Somebody can compare this "first character optimization"  versus "efficient text-directed engines"  ?

-Sergey

From: Michael Kay [mailto:mike@saxonica.com]
Sent: Friday, September 14, 2012 1:11 PM
To: xmlschema-dev@w3.org
Subject: Re: sml schema regular expression performance


I don't understand what you are saying.

XML Schema is a specification, not an implementation. The regex language it defines has no differences from other popular regex languages that would make it less efficient.

The decision to anchor regexes by default has nothing to do with efficiency or performance: it's all about usability. A typical regex in XSD might be one that is used to validate that postcodes take the form XX99 99X. The rule that the user wants to enforce is that the string as a whole should match the pattern, not that some substring should match the pattern. It's very rare to see a validation rule where anchoring isn't appropriate.

Clearly there is no difference in performance between using a regular expression [A-Z]{3}[0-9]{3} that is implicitly anchored, and using a regular expression ^[A-Z]{3}[0-9]{3}$ that is explicitly anchored. It's only a usability difference.

Similarly, in the rare cases where you want an unanchored match, say to test that a string contains at least one occurrence of "(...)", a processor that chooses to do so can easily recognize the pattern ".*\(.*\).*" and strip off the leading and trailing ".*" and then do an unanchored match if it thinks that the regex library will be able to process it more quickly this way. Whether this is the case is of course likely to vary from one regex engine to another.

In practice XSD regexes are usually used to validate fairly short strings, so it's very rare to experience performance problems. Some users do put monstrous regular expressions in their schemas, but performance depends much more on the length of the input string than on the complexity of the regex. Paradoxically, because problems are rare, and regex evaluation isn't usually on the critical path, there's actually little incentive for XSD implementors to put a lot of effort into regex performance improvement.

Michael Kay
Saxonica
On 11/09/2012 23:03, Armishev, Sergey wrote:
Hi, XML schema experts.
I have question regarding Regular Expression performance. I received the opinion that XML schema using very inefficient engine.
Below is a statement. Is it TRUE or FALSE and WHY


when it comes to regex, XSD has one thing VERY WRONG:

anchoring regexes is NOT a way to make them perform better!



Ever heard of "first character optimization" in regex engines? Pretty much all engines _on earth_ have that nowadays. And they perform _worse_ when the regex is surrounded with .*.



End of citation.

What is your take on this?



-Sergey
Received on Friday, 14 September 2012 19:11:00 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 14 September 2012 19:11:00 GMT