- From: Michael Kay <mike@saxonica.com>
- Date: Fri, 14 Sep 2012 20:48:25 +0100
- To: "Armishev, Sergey" <sarmishev@idirect.net>
- CC: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
- Message-ID: <50538A09.4020209@saxonica.com>
First-character optimization is all about finding a place in the string where it's worth attempting a match. As such, it doesn't apply to XSD regular expressions, because there is only one place you are allowed to attempt a match - namely, at the beginning of the string. Michael Kay Saxonica On 14/09/2012 20:09, Armishev, Sergey wrote: > > Thank you, Michael for very detailed answer. Still would like to know > about performance. XSD schema does defines its own flavor > http://www.regular-expressions.info/xml.html and the statement is > > Compared with other regular expression flavors > <http://www.regular-expressions.info/refflavors.html>, the XML schema > flavor is quite limited in features. Since it's only used to validate > whether an entire element matches a pattern or not, rather than for > extracting matches from large blocks of data, you won't really miss > the features often found in other flavors. The limitations allow > schema validators to be implemented with efficienttext-directed > engines <http://www.regular-expressions.info/engine.html>. > > The arguments against XML schema regular expression performance that I > cited is that such flavor can't use "first character optimization" . > Somebody can compare this "first character optimization" versus > "efficient text-directed engines" ? > > -Sergey > > *From:*Michael Kay [mailto:mike@saxonica.com] > *Sent:* Friday, September 14, 2012 1:11 PM > *To:* xmlschema-dev@w3.org > *Subject:* Re: sml schema regular expression performance > > > I don't understand what you are saying. > > XML Schema is a specification, not an implementation. The regex > language it defines has no differences from other popular regex > languages that would make it less efficient. > > The decision to anchor regexes by default has nothing to do with > efficiency or performance: it's all about usability. A typical regex > in XSD might be one that is used to validate that postcodes take the > form XX99 99X. The rule that the user wants to enforce is that the > string as a whole should match the pattern, not that some substring > should match the pattern. It's very rare to see a validation rule > where anchoring isn't appropriate. > > Clearly there is no difference in performance between using a regular > expression [A-Z]{3}[0-9]{3} that is implicitly anchored, and using a > regular expression ^[A-Z]{3}[0-9]{3}$ that is explicitly anchored. > It's only a usability difference. > > Similarly, in the rare cases where you want an unanchored match, say > to test that a string contains at least one occurrence of "(...)", a > processor that chooses to do so can easily recognize the pattern > ".*\(.*\).*" and strip off the leading and trailing ".*" and then do > an unanchored match if it thinks that the regex library will be able > to process it more quickly this way. Whether this is the case is of > course likely to vary from one regex engine to another. > > In practice XSD regexes are usually used to validate fairly short > strings, so it's very rare to experience performance problems. Some > users do put monstrous regular expressions in their schemas, but > performance depends much more on the length of the input string than > on the complexity of the regex. Paradoxically, because problems are > rare, and regex evaluation isn't usually on the critical path, there's > actually little incentive for XSD implementors to put a lot of effort > into regex performance improvement. > > Michael Kay > Saxonica > > On 11/09/2012 23:03, Armishev, Sergey wrote: > > Hi, XML schema experts. > > I have question regarding Regular Expression performance. I > received the opinion that XML schema using very inefficient engine. > > Below is a statement. Is it TRUE or FALSE and WHY > > when it comes to regex, XSD has one thing VERY WRONG: > > anchoring regexes is NOT a way to make them perform better! > > Ever heard of "first character optimization" in regex engines? > Pretty much all engines _on earth_ have that nowadays. And they > perform _worse_ when the regex is surrounded with .*. > > End of citation. > > What is your take on this? > > -Sergey >
Received on Friday, 14 September 2012 19:48:52 UTC