From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>

Date: Mon, 25 Apr 2011 16:20:46 -0600

Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org

Message-Id: <715398FB-905A-4615-B284-470021F985F6@blackmesatech.com>

To: Michael Kay <mike@saxonica.com>

Date: Mon, 25 Apr 2011 16:20:46 -0600

Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org

Message-Id: <715398FB-905A-4615-B284-470021F985F6@blackmesatech.com>

To: Michael Kay <mike@saxonica.com>

On Apr 22, 2011, at 11:13 AM, Michael Kay wrote: > >> Since regular languages are closed under intersection, >> the result of a logical 'and' of two regexes is a regular >> language. And it will have an expression using XSD regular >> expressions. > > Interesting. Is there a general algorithm for deriving that expression? In the absence of the "&&" and lookahead operators, it doesn't seem easy. All the treatments of regular languages I've seen tend to handle this kind of problem using automata, not regular expressions, so the easiest cut at a general algorithm is: (1) generate finite-state automata A1, A2, ..., one for each regular expression E1, E2, ... (2) from the old formal-language-and-automata textbook on your shelf, if you have one, look up the algorithm for calculating a finite-state automaton that recognizes the intersection of the languages recognized by some set of finite-state automata. It's been a while since I read the discussion of this topic in a reliable textbook, but working from memory and not trying to do anything clever to optimize anything, I think it works roughly like this: - Make the existing automata complete (i.e. add an error state if they don't already have one), so the transition function is defined for all state, input-symbol pairs. - Take the new set of states from the Cartesian product of the states of the input automata. (If you have two input automata, make a new state for each possible pairing between a state of A1 and a state of A2.) - Make the new transition function from the existing transition functions. If you're in state (X,Y), with input symbol S, then the next state is (A1(X,S), A2(Y,S)). You're now done, but for practical purposes you probably want to minimize the automaton -- the simple algorithm just described produces a lot more states than you need. (3) From the resulting automaton generate a regular expression. Oddly, I haven't seen this described in the textbooks I've looked at, but Henry Thompson and I searched the Web on the topic once and found an algorithm that's easy enough to understand. So, yes, I think there's a general algorithm. There are several reasons you may not want to use it, however: (1) Explicitly calculating the FSA in step 2 may not be fast. For input FSAs with N and M states respectively, you're looking at N*M time and space. (2) Minimizing the FSA in step 2 is often not fast; you're looking, if I recall correctly, at (N*M) ** 2 time. The textbooks tell me that editors that handle regexes usually don't bother determinizing their automata, because it's faster to handle the non-determinism than it would be to determinize the automaton and then do the search. Of course, for a compiled schema to be used a lot, the relative benefits might change. (3) The regex you get out of step 3 is apt to be more verbose than strictly necessary. If I understand correctly, the state of play is that you're guaranteed to be able to get a regex, but getting a 'good' regex, a nice clear one that captures the language cleanly, would be very difficult even if we had a clear way to measure the relative goodness of different regular expressions for the same language, which we don't. The documentation for the old Grail package developed by Derick Wood and Darrell Raymond made a virtue of this: by taking a regex, building an FSA, and then making a regex from that FSA, it's possible to get a much more complicated regex that denotes a known and well understood language. By running through that cycle more than once, it's easy to expand a one-line regex by a factor of forty or a hundred; they reckoned it was a good way to test regex-handling routines. Of course, this only matters if you actually plan on showing the regex to any human beings. I hope this is helpful; I can provide more detail if needed. -- **************************************************************** * C. M. Sperberg-McQueen, Black Mesa Technologies LLC * http://www.blackmesatech.com * http://cmsmcq.com/mib * http://balisage.net ****************************************************************Received on Monday, 25 April 2011 22:21:13 GMT

*
This archive was generated by hypermail 2.2.0+W3C-0.50
: Monday, 25 April 2011 22:21:15 GMT
*