- From: Michael Kay <mike@saxonica.com>
- Date: Fri, 14 Sep 2012 20:48:25 +0100
- To: "Armishev, Sergey" <sarmishev@idirect.net>
- CC: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
- Message-ID: <50538A09.4020209@saxonica.com>
First-character optimization is all about finding a place in the string
where it's worth attempting a match. As such, it doesn't apply to XSD
regular expressions, because there is only one place you are allowed to
attempt a match - namely, at the beginning of the string.
Michael Kay
Saxonica
On 14/09/2012 20:09, Armishev, Sergey wrote:
>
> Thank you, Michael for very detailed answer. Still would like to know
> about performance. XSD schema does defines its own flavor
> http://www.regular-expressions.info/xml.html and the statement is
>
> Compared with other regular expression flavors
> <http://www.regular-expressions.info/refflavors.html>, the XML schema
> flavor is quite limited in features. Since it's only used to validate
> whether an entire element matches a pattern or not, rather than for
> extracting matches from large blocks of data, you won't really miss
> the features often found in other flavors. The limitations allow
> schema validators to be implemented with efficienttext-directed
> engines <http://www.regular-expressions.info/engine.html>.
>
> The arguments against XML schema regular expression performance that I
> cited is that such flavor can't use "first character optimization" .
> Somebody can compare this "first character optimization" versus
> "efficient text-directed engines" ?
>
> -Sergey
>
> *From:*Michael Kay [mailto:mike@saxonica.com]
> *Sent:* Friday, September 14, 2012 1:11 PM
> *To:* xmlschema-dev@w3.org
> *Subject:* Re: sml schema regular expression performance
>
>
> I don't understand what you are saying.
>
> XML Schema is a specification, not an implementation. The regex
> language it defines has no differences from other popular regex
> languages that would make it less efficient.
>
> The decision to anchor regexes by default has nothing to do with
> efficiency or performance: it's all about usability. A typical regex
> in XSD might be one that is used to validate that postcodes take the
> form XX99 99X. The rule that the user wants to enforce is that the
> string as a whole should match the pattern, not that some substring
> should match the pattern. It's very rare to see a validation rule
> where anchoring isn't appropriate.
>
> Clearly there is no difference in performance between using a regular
> expression [A-Z]{3}[0-9]{3} that is implicitly anchored, and using a
> regular expression ^[A-Z]{3}[0-9]{3}$ that is explicitly anchored.
> It's only a usability difference.
>
> Similarly, in the rare cases where you want an unanchored match, say
> to test that a string contains at least one occurrence of "(...)", a
> processor that chooses to do so can easily recognize the pattern
> ".*\(.*\).*" and strip off the leading and trailing ".*" and then do
> an unanchored match if it thinks that the regex library will be able
> to process it more quickly this way. Whether this is the case is of
> course likely to vary from one regex engine to another.
>
> In practice XSD regexes are usually used to validate fairly short
> strings, so it's very rare to experience performance problems. Some
> users do put monstrous regular expressions in their schemas, but
> performance depends much more on the length of the input string than
> on the complexity of the regex. Paradoxically, because problems are
> rare, and regex evaluation isn't usually on the critical path, there's
> actually little incentive for XSD implementors to put a lot of effort
> into regex performance improvement.
>
> Michael Kay
> Saxonica
>
> On 11/09/2012 23:03, Armishev, Sergey wrote:
>
> Hi, XML schema experts.
>
> I have question regarding Regular Expression performance. I
> received the opinion that XML schema using very inefficient engine.
>
> Below is a statement. Is it TRUE or FALSE and WHY
>
> when it comes to regex, XSD has one thing VERY WRONG:
>
> anchoring regexes is NOT a way to make them perform better!
>
> Ever heard of "first character optimization" in regex engines?
> Pretty much all engines _on earth_ have that nowadays. And they
> perform _worse_ when the regex is surrounded with .*.
>
> End of citation.
>
> What is your take on this?
>
> -Sergey
>
Received on Friday, 14 September 2012 19:48:52 UTC