Re: sml schema regular expression performance

First-character optimization is all about finding a place in the string 
where it's worth attempting a match. As such, it doesn't apply to XSD 
regular expressions, because there is only one place you are allowed to 
attempt a match - namely, at the beginning of the string.

Michael Kay
Saxonica

On 14/09/2012 20:09, Armishev, Sergey wrote:
>
> Thank you, Michael for very detailed answer. Still would like to know 
> about performance. XSD schema does defines its own flavor 
> http://www.regular-expressions.info/xml.html and the statement is
>
> Compared with other regular expression flavors 
> <http://www.regular-expressions.info/refflavors.html>, the XML schema 
> flavor is quite limited in features. Since it's only used to validate 
> whether an entire element matches a pattern or not, rather than for 
> extracting matches from large blocks of data, you won't really miss 
> the features often found in other flavors. The limitations allow 
> schema validators to be implemented with efficienttext-directed 
> engines <http://www.regular-expressions.info/engine.html>.
>
> The arguments against XML schema regular expression performance that I 
> cited is that such flavor can't use "first character optimization" . 
> Somebody can compare this "first character optimization"  versus 
> "efficient text-directed engines"  ?
>
> -Sergey
>
> *From:*Michael Kay [mailto:mike@saxonica.com]
> *Sent:* Friday, September 14, 2012 1:11 PM
> *To:* xmlschema-dev@w3.org
> *Subject:* Re: sml schema regular expression performance
>
>
> I don't understand what you are saying.
>
> XML Schema is a specification, not an implementation. The regex 
> language it defines has no differences from other popular regex 
> languages that would make it less efficient.
>
> The decision to anchor regexes by default has nothing to do with 
> efficiency or performance: it's all about usability. A typical regex 
> in XSD might be one that is used to validate that postcodes take the 
> form XX99 99X. The rule that the user wants to enforce is that the 
> string as a whole should match the pattern, not that some substring 
> should match the pattern. It's very rare to see a validation rule 
> where anchoring isn't appropriate.
>
> Clearly there is no difference in performance between using a regular 
> expression [A-Z]{3}[0-9]{3} that is implicitly anchored, and using a 
> regular expression ^[A-Z]{3}[0-9]{3}$ that is explicitly anchored. 
> It's only a usability difference.
>
> Similarly, in the rare cases where you want an unanchored match, say 
> to test that a string contains at least one occurrence of "(...)", a 
> processor that chooses to do so can easily recognize the pattern 
> ".*\(.*\).*" and strip off the leading and trailing ".*" and then do 
> an unanchored match if it thinks that the regex library will be able 
> to process it more quickly this way. Whether this is the case is of 
> course likely to vary from one regex engine to another.
>
> In practice XSD regexes are usually used to validate fairly short 
> strings, so it's very rare to experience performance problems. Some 
> users do put monstrous regular expressions in their schemas, but 
> performance depends much more on the length of the input string than 
> on the complexity of the regex. Paradoxically, because problems are 
> rare, and regex evaluation isn't usually on the critical path, there's 
> actually little incentive for XSD implementors to put a lot of effort 
> into regex performance improvement.
>
> Michael Kay
> Saxonica
>
> On 11/09/2012 23:03, Armishev, Sergey wrote:
>
>     Hi, XML schema experts.
>
>     I have question regarding Regular Expression performance. I
>     received the opinion that XML schema using very inefficient engine.
>
>     Below is a statement. Is it TRUE or FALSE and WHY
>
>     when it comes to regex, XSD has one thing VERY WRONG:
>
>     anchoring regexes is NOT a way to make them perform better!
>
>     Ever heard of "first character optimization" in regex engines?
>     Pretty much all engines _on earth_ have that nowadays. And they
>     perform _worse_ when the regex is surrounded with .*.
>
>     End of citation.
>
>     What is your take on this?
>
>     -Sergey
>

Received on Friday, 14 September 2012 19:48:52 UTC