Re: Algorithm for merging the pattern facets in a base simpleType with a subtype? from C. M. Sperberg-McQueen on 2011-04-25 (xmlschema-dev@w3.org from April 2011)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 25 Apr 2011 16:20:46 -0600
To: Michael Kay <mike@saxonica.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org
Message-Id: <715398FB-905A-4615-B284-470021F985F6@blackmesatech.com>

On Apr 22, 2011, at 11:13 AM, Michael Kay wrote:

>
>> Since regular languages are closed under intersection,
>> the result of a logical 'and' of two regexes is a regular
>> language. And it will have an expression using XSD regular
>> expressions.
>
> Interesting. Is there a general algorithm for deriving that expression? In the absence of the "&&" and lookahead operators, it doesn't seem easy.

All the treatments of regular languages I've seen tend to handle
this kind of problem using automata, not regular expressions, so the
easiest cut at a general algorithm is:

(1) generate finite-state automata A1, A2, ..., one for each regular
expression E1, E2, ...

(2) from the old formal-language-and-automata textbook on your
shelf, if you have one, look up the algorithm for calculating a
finite-state automaton that recognizes the intersection of the
languages recognized by some set of finite-state automata.

It's been a while since I read the discussion of this topic in a
reliable textbook, but working from memory and not trying to do
anything clever to optimize anything, I think it works roughly like
this:

- Make the existing automata complete (i.e. add an error state if
they don't already have one), so the transition function is
defined for all state, input-symbol pairs.

- Take the new set of states from the Cartesian product of the
states of the input automata. (If you have two input automata,
make a new state for each possible pairing between a state of A1
and a state of A2.)

- Make the new transition function from the existing transition
functions. If you're in state (X,Y), with input symbol S, then
the next state is (A1(X,S), A2(Y,S)).

You're now done, but for practical purposes you probably want to
minimize the automaton -- the simple algorithm just described
produces a lot more states than you need.

(3) From the resulting automaton generate a regular expression.
Oddly, I haven't seen this described in the textbooks I've looked
at, but Henry Thompson and I searched the Web on the topic once and
found an algorithm that's easy enough to understand.

So, yes, I think there's a general algorithm. There are several
reasons you may not want to use it, however:

(1) Explicitly calculating the FSA in step 2 may not be fast. For
input FSAs with N and M states respectively, you're looking at N*M
time and space.

(2) Minimizing the FSA in step 2 is often not fast; you're looking,
if I recall correctly, at (N*M) ** 2 time. The textbooks tell me
that editors that handle regexes usually don't bother determinizing
their automata, because it's faster to handle the non-determinism
than it would be to determinize the automaton and then do the
search.

Of course, for a compiled schema to be used a lot, the relative
benefits might change.

(3) The regex you get out of step 3 is apt to be more verbose than
strictly necessary.

If I understand correctly, the state of play is that you're
guaranteed to be able to get a regex, but getting a 'good' regex, a
nice clear one that captures the language cleanly, would be very
difficult even if we had a clear way to measure the relative
goodness of different regular expressions for the same language,
which we don't. The documentation for the old Grail package
developed by Derick Wood and Darrell Raymond made a virtue of this:
by taking a regex, building an FSA, and then making a regex from
that FSA, it's possible to get a much more complicated regex that
denotes a known and well understood language. By running through
that cycle more than once, it's easy to expand a one-line regex by a
factor of forty or a hundred; they reckoned it was a good way to
test regex-handling routines.

Of course, this only matters if you actually plan on showing the
regex to any human beings.

I hope this is helpful; I can provide more detail if needed.

--
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************

Received on Monday, 25 April 2011 22:21:13 UTC