Supporting UPA and Restriction on an Extension of XML Schema

2 Evaluating the Unique Particle Attribute Constraint

We start by having a set of terminals, which we will represent using a - z. These correspond to element names in XML Schema.

A schema regular expression is defined by the following grammar:

particle -> terminal

| particle{m, n} (Where 0 ≤ m ≤ n ≤ unbounded - we will never consider m = unbounded or n = 0)

| (particle₁, ..., particle_n)

| (particle₁| ...| particle_n)

| (particle₁& ...& particle_n)

We will use α - ω to refer to particles generally. This extends traditional regular expressions by adding the &-group and allowing m to take values other than 0 or 1, and n to take values other than 1 and unbounded (exponents {0, 1} corresponds to ?, {0, unbounded} to * and {1, unbounded} to +).

In order to determine if a regular expression follow UPA, we will need the following five properties of a particle:

The particles contained in p, particles(p), where a particle is any terminal or particle recursively defined.
particles(p) => if p = (α₁,...,α_n) then {p} ∪ (∪ⁿ_{i = 1} particles(α_i))
else if p = (α₁&...&α_n) then {p} ∪ (∪ⁿ_{i = 1} particles(α_i))
else if p = (α₁|...|α_n) then {p} ∪ (∪ⁿ_{i = 1} particles(α_i))
else if p = α{m,n} then {p} ∪ particles(α)
else if p is a terminal, {p}
The opacity of a particle, opaque(p). A particle is opaque if it cannot match the empty string. Otherwise it is transparent. This is defined as follows:
opaque(p) => if p = α{0, n}, (α any particle) then false
else if p = (α₁, ..., α_n) or p = (α₁ & ... & α_n) then \/ⁿ_{i = 1} opaque(α_i)
else if p = (α₁| ... | α_n) the /\ⁿ_{i = 1} opaque(α_i)
else if p = α{m, n}, m > 0, then opaque(α)
else if p = a, a a terminal, then true
else if p is the empty sequence, then false.
The first set, first(p), represents all those particles that can match the first terminal in a string matching this particle. We define it as follows:
first(p) => if p = a, a a terminal, then {a}
else if p = α{m, n} then first(α)
else if p = (α₁|...|α_n) or p = (α₁&...&α_n) then ∪ⁿ_{i = 1}first(α_i)
else if p = (α₁,...,α_n) then
if opaque(α₁) then first(α₁)
else first(α₁) ∪ first((α₂,...,α_n))
The follow set, follow(p), represents all those particles that can match a letter in a string immediately following a portion of the string matched by the particle, p. Unlike the other properties defined here, follow(p) can only be defined in terms of the expression containing p. We therefore define follow(Γ, p, s), Γ a schema regular expression and s a set, and define follow(p) = follow(Γ, p, {}), where Γ is the entire regular expression under consideration.
follow(Γ, p, s) => if p = Γ then s
else if Γ = Δ{m, m}, and opaque(Δ) then follow(Δ, p, s)
else if Γ = Δ{m, n}, m < n or not opaque(Δ) then follow(Δ, p, s ∪ first(Δ))
else if Γ = (α₁|...|Δ|...|α_n), where p ∈ particles(Δ) then follow(Δ, p, s)
else if Γ = (α₁&...&Δ&...;&α_n), where p ∈ particles(Δ) then follow(Δ, p, s)
else if Γ = (α₁,...,Δ,α_m...,α_n), where p ∈ particles(Δ) then
if opaque((α_m,...,α_n)) then follow(Δ, p, first((α_m,...,α_n)))
else if not opaque((α_m,...,α_n)) then follow(Δ, p, s ∪ first((α_m,...,α_n)))
The confusion set, confusion(p). This gives the set of particles in p which could be in conflict with follow(p). In other words, α ∈ particles(p) shows up in confusion(p) if there is some particle, β ∈ particles(p), such that follow(β) ⊃ follow(p) and α ∈ follow(β). The set confusion(p) is the union of the follow sets of particle of P having β in their follow sets (i.e., particles that could be confused with a particle in the follow set).
confusion(p) => (∪_{q ∈ particles(p)} follow(p, q, {δ})) - {δ} where not (δ ∈ particles(p)) and δ ∈ follow(p, q, {δ})

As an example, consider the following regular expression:

((a | b),(((c | d) & (e, f{0,1}){2,5}){6,6} | (g & h{0,5})))

In order to better discuss it, let's first subscript every particle in the expression:

((a₀ | b₁)₂,(((c₃ | d₄)₅ & (e₆, f₇{0,1}₈)₉{2,5}₁₀)₁₁{6,6}₁₂ | (g₁₃ & h₁₄{0,5}₁₅)₁₆)₁₇)₁₈

In this we have the entire expression being opaque, but particles 8 and 15 are not. Because the entire expression is a sequence, and the first particle in it (2) is opaque, the first set of the expression is {a, b}. The follow set of 2 - coincidentally the first set of 17 - is the first set of the rest of the sequence. Since it is a choice, it is the union of the first sets of the branches (particles 12 and 16). Both branches are &-groups, so they atain return the union of their first sets, or {c₃, d₄, e₆, g₁₃, h₁₄}. Only some particles have non-empty confusion sets. Particle 9 has {f₇} in its confusion set, as f₇ is in follow(e₆). Because of the optionality, confusion(10) has both e₆ and f₇ in it, as does confusion(12).

2.1 Plan of Attack

Ultimately a regular expression is for matching strings of terminals to determine membership in a language. This is done by matching terminals in the input against successive terminal particles in the expression. A regular expression has the UPA if, for any input string, at any point there is only particle in the regular expression which could be used to match the next item in the input. The regular expression (a, b) has UPA because it will only match an 'a' followed by a 'b'. The regular expression (a{0,1}, a) does not because when the first 'a' shows up, it could match either the first a? particle or the second a particle. Determining if a regular expression has UPA is not difficult if we use standard regular expressions as described above, but numeric exponents complicate the picture. In the first case we need only check that there are no two different particles with the same terminal either in the first set of the expression or in the follow set of any particle in the expression. (As an example of why first sets alone are insufficient, consider (a, a{0,1}, (b | a))). In the latter case, though, we have examples like:

(a₀{4,8},a₁)
(a₀{8,8},a₁)
((a₀,b₁{0,1}){8,8},b₂)
((a₀ & b₁ & (c₂, d₃{0,1})), d₄)
((a₀ & b₁ & c₂), b₃)
(a₀ & b₁ & (c₂, b₃{0,1}))

In the first case, we have a UPA violation because the fourth match of a₀ can be followed by either a fifth repetition or a₁. The follow function rightly puts both in follow(a₀). However, in the second case there is no ambiguity (8 repetitions of a₀ followed by one a₁. But in the third case we again have an ambiguity. On the eighth repetition, a₀ can be followed by either b₁ or b₂ - but if we make example two work, then b₂ is not in follow(a₀).

In the fourth case we have a UPA violation between d₃ and d₄. Therefore we'd want the follow set of the &-group to be in the follow set of the individual branches. However, if we do that, then we'd also need to make sure that the first sets of the other branches are not also included, because otherwise c₂ would have both b₁ and b₃ in its follow set in the fifth example. But that would then miss the UPA violation in the sixth example between b₁ and b₃.

2.2 Sources of UPA violations

Because of the issues with the two problematic constructs of &-groups and numeric exponents of the form {m,m} we need to delve further into aspects of UPA violations. There are two sources of UPA violations involving a particle that are particularly germane to our analysis.

2.2.1 External violations

Given a particle, there are a limited number of reasons that it can cause a UPA violation. Take the particle, P, in the following:

(α, β{0,1}, P, γ) or (α, (β | P), γ) or (α, (β & P), γ)

A violation can only result from a confusion among terminals. Either:

One terminal is before P and one is within P. If so, then the terminal in front of P must be optional (i.e., β). For that to be the case, there must be a member of first(β) that's also in first(P). Note that we only need first(P) to check this.
Both terminals are in P. Then P itself violates UPA regardless of context.
One terminal is in P and one is in follow(P). This means there is an optional terminal, o, somewhere in P, and it is in the follow set of some other terminal, t, whose follow set also includes γ. Terminal t can either be inside P or precede P (such as β). If t is inside P, then o must be in confusion(P). If t precedes P, then P must be optional, and first(P) shares a terminal with follow(P).
Neither terminal is in P. Then P is optional and (in this case) first(β) shares a terminal with first(γ).

Note that if P itself does not violate UPA, then we only need consider its opacity, its first set, and its confusion set.

2.2.2 Internal ambiguity

Suppose we have a particle, P, and P obeys UPA. Under what circumstances does P{m,n} not have UPA? At a minimum, if P{m,n} does not have UPA, then (P, P) does not have it either, although the reverse is not necessarily true (because of repeated particles). But if P obeys UPA, and (P, P) does not, then there must be some particle in P whose follow set would include the first set of the second P, and that follow set already includes a terminal in first(P). By definition, that follow set is included in follow(P). This is invariant with the number of repetitions of P.

Therefore we can test for whether P{m,n} obeys UPA by intersecting first(P) and confusion(P).

Likewise for (α₁ & ... & α_n). Assuming each α_i obeys UPA, either ∪ⁿ_{i = 1}first(α_i) contains the same terminal with different subscripts or there are α_j and α_k, j not = k, where (α_j, α_k) violates UPA. Once again, this means there is some terminal in both confusion(α_j) and first(α_k). However, because a branch in an &-group never follows itself, it is not an error if (α_j, α_j) violates UPA because that would never occur in a single instance of the group - it would only happen if the &-group has an exponent, and then confusion(α_j), being a subset of the confusion set of the group as a whole, would be compared with the first set of the group as a whole, exposing the issue. So we only need compare confusion(α_j) with the first sets of the other branches.

2.3 Final UPA Algorithm

We therefore end up with the following algorithm:

UPA(α) => if α = athen

if b_i ∈ follow(a)and b_j ∈ follow(a) -> i=j then true

else if α = β{m,n} then UPA(β) /\ (first(β) ∩ confusion(β) = {})

else if (α = (β₁,...,β_n) or α = (β₁|...|β_n)) and ∩ⁿ_{i = 1} first(β_i) = {} then /\ⁿ_{i = 1}UPA(β_i)

else if α = (β₁&...& β_n) and ∩ⁿ_{i = 1} first(β_i) = {} then /\ⁿ_i=1(UPA(β_i) /\ (confusion(β_i) ∩ (∪_{j not = i}first(β_j)) = {})

else false.

3 Testing Subsumption

As with UPA testing, new features like numeric exponents and &-groups add to the complexity of testing a subsumption relationship between two regular expressions, where by subsumption we mean all the strings accepted by the subsmed expression are also valid in the subsuming expressing, but not necessarily the reverse.

Without these features, given two regular expressions, α and β, that obey UPA, it is possible to determine whether α subsumes β in a straightforward way. If we consider the set of strings accepted by α L(α) and the set of strings accepted by β, L(β) as subsets of the set of all strings, then L(β) subset L(α) iff L(β) ∩ ~L(α) = {}. In other words, every string in L(β) should also be in L(α). ~L(α) is the set of strings not in L(α), so there should be no string that's both in ~L(α) and L(β). This can be determined by constructing an automaton from each regular expression, inverting the final states of the automaton from α, and constructing the cross-product automaton, an automaton whose states are pairs of states, one from each source automaton. This can be done in time quadratic with the two input automata, as there is only one state which has both start states, and from any state pair and for any terminal, there can be at most one successor state pair. Since we've inverted the accepting states in the one automaton, the subsumption holds if there is no state pair in the resulting automaton where both members of the pair is an accepting state from their respective automata.

We will approach this issue in two stages. We will first extend the above algorithm to handle &-groups and then extend that to additionally handle numeric exponents.

3.1 Subsumption with &-groups

The straightforward subsumption story is complicated by &-groups because, even with UPA, at any point there may be more than one terminal that can match from the other machine. Consider the following pair:

(a{0,1}, (d & c & b), b)
(c, b, d, b)

Clearly, the second expression is subsumed by the first. The a is optional, (b, c, d) is one of the six possible paths through the & group, and b is the final terminal to be matched. However, the c and d branches of the &-group both have two transitions to a b, so the naive algorithm fails. A naive fix would be to break apart the group, listing all the alternatives:

(a{0,1}, ((d , c , b) | (d , b , c) | (c , d , b) | (c , b , d)(b , c , d)(b , d , c)), b)

However, this is exponential (actually O(n!)) in the size of the orignal expression, although the resulting expression still obeys UPA.

Any solution not expanding the content model needs to keep track of which branches of the &-group have been traversed at any point and only move from the &-group when all the required branches have been matched. Such a solution would also need to keep track of nested &-group - unlike choice, &-groups are commutative but not associative; i.e. ((a & b) & (c & d)) is not the same as ((a & c) & (b & d)), as the second allows (a, c, b, c), but the first requires that a be adjacent to b. This would imply the use of a stack.

To summarize, we need a means of indicating all the branches of an &-group such that we can keep track of which branches have been traversed and which have not. Further, this should not require rewriting the content model - either in advance or during execution of the algorithm - as that would return us to exponential behavior.

We will handle each &-group as a bit-vector of the length of the number of branches. We will divide this vector in two parts - the upper part corresponds to the opaque branches and the lower part to the transparent branches. For example, suppose we have the &-group (a? & b & c &d?) and want to see if it subsumes the sequence (b, a, c). We have two opaque branches and two transparent ones. We rewrite this as (b₃ & c₂ & d₁? &a₀?), corresponding to the bit-vector 1111, or 15, for all four branches, or 12, for just the opaque branches. We assign each branch to the corresponding bit. Let us call these the max and min values for the group. On entering the &-group we start with a bit-vector of 0000 - no branches completed. We cannot "leave" the group unless we've matched at least the required branches, so we'd need a bit vector of at least 1100. At each branch check, the bit vector, as a binary value, must increase so we know we've finished a new branch. Transitions corresponding to choices where the appropriate bit is 0 are active. Those wehre the bit is 1 are inactive.

To show the process, we first match b with b₃ and xor the bit-vector with 2³, where the power is the number of the matched branch. This gives a bit vector or 1000. Next we match the a with a₀ and the bit vector becomes 1001. Finally we match the c with c₂, giving a bit-vector of 1101. As 1100 ≤ 1101 ≤ 1111, the subsumption holds.

On the other hand, if we try to check (b, a, d), our final bit-vector is 1011. Since 1011 < 1100, the subsumption doesn't hold. Finally, if we try to check (b, b, d), we'd check the first b and get 1000. Checking the second b, we'd xor 1000 with 1000 with a result of 0000. As 0000 < 1000, we'd know we've tried to match the same branch twice and the subsumption fails.

At this point we need to describe how to build an automaton for an extended regular expression. In addition to the usual transitions, we need to keep track of the following when traversing a terminal:

When entering an &-group, push an empty bit-vector on the stack, or push().
For any state transition from the end of a branch, update the top bit-vector, or update(n), where n is a power of 2. This returns true if the result of an update increases the value of the bit-vector and false otherwise.
For any state transition leaving the group, ensure the bit-vector's value is between the min and the max and then pop it, or test(min, max).

Note that there may be more two transitions from a state with the same terminal. However (due to UPA) one will return to the &-group and the other will leave it. Each has an associated test condition, and only one can succeed.

Each transition, t, in our automaton therefore has properties:

pre(t) is a state, the state the machine must currently be in for this transition to be applied.
post(t) is the state the machine is in after the transition
terminal(p) is the terminal that this transition matches
action(p) is a list of push(), update(n), or test(min, max). It may be empty if no actions are to be performed for that transition.

This automaton does not have provision for numeric exponents - they'll be handled in the sequel.

Consider the following expression: (₀a₁, ((b⁰₂ & c¹₃){0,1}⁰ & ((d₄, e₅) | e₆)¹), e₇), where the subscripts are for identifying the states we'll produce for the automaton and the superscripts are for the branches of the &-groups. By the discussion above, we start at state 0 and produce the following transitions:

(0, 1, a, [])
(1, 2, b, [push(), push(), update(0)])
(1, 3, c, [push(), push(), update(1)])
(1, 4, d, [push()])
(1, 6, e, [push()])
(2, 3, c, [update(1)])
(2, 4, d, [test(3,3), update(0)])
(2, 6, e, [test(3,3), update(0)])
(2, 7, 3 [est(3,3), update(0), test(2,3)])
(3, 2, b, [update(0)])
(3, 4, d, [test(3,3), update(0)])
(3, 6, e, [test(3,3), update(0)])
(3, 7, e, [test(3,3), update(0), test(2,3)])
(4, 5, e, [])
(5, 2, b, [update(1), push()])
(5, 3, c, [update(1), push()])
(5, 7, e, [update(1), test(2, 3)])
(6, 7, e, [update(1), test(2,3)])

The only accepting state is 7.

3.2 Comparing Two Automata

When comparing two automata we are building a new automaton that is a subset of the product of the two input automata. Let us call the automaton for the initial content model B (for base) and the one we are testing as R (for restriction). At each point, there will be some set of possible transitions from a current state in the automaton for R for which corresponding transitions must exist in B. There are two sets of important subcases:

At least one of the transitions is into an &-group
None of the transitions is into an &-group

The latter case corresponds to traditional automata - the former to our special cases. All in all, this creates nine cases. For each, transitions can be one of:

All "ordinary" - there are no push commands associated with the transition, so it does not enter an &-group. We will call these simple transitions.
All lead into &-groups. The number of groups may actually be nested. We will call transitions leading into an &-group "&-group transitions".
A mix of the above, where there was a choice (or transparent particle) in the original content model between an &-group and some other particles.

There is also an important asymmetry between B and R for determining valid subsumptions. In particular, if some state in R is matched to a state in B by the algorithm, and one transition from the state in R goes to an &-group, then one of the following must be true:

All the transitions in B are simple choices. For example, (a & b & c) is subsumed by (a | b | c){1, unbounded}. Likewise, ((a & b & c) | d | e) is subsumed by (a | b | c | d | e){1, unbounded}. Note, additionally, there must be exactly one branch in B for each branch in R - (a & b & c) is not subsumed by (a | (b, c)) because it doesn't allow the ordering (b, a, c)
At least the transitions in B corresponding to an &-group transitions in R to the same &-group are also to an &-group, and no other choices may be in that &-group. For example, ((a & b & c) | d | e) is subsumed by ((a & b & c & f{0,1}) d | e){1, unbounded} but not by ((a & b & c & d) | e){1, unbounded}, as the latter wouldn't allow the d to come after the e. Therefore if the algorithm is matching transitions from an &-group in R and an &-group in B, it can't leave the &-group in R unless it also leaves B. To aid in this, we introduce two predicates on states:
1. &-group-only(q) returns true iff all the unmatched transitions leaving state q enter an &-group. Matched transitions are those that have already been examined by the altgorithm and used in a transition in the output.
2. mixed(q) returns true if some of the unmatched transitions leaving state q enter an &-group, while others do not.

Finally, if the current transition in R is part of an &-group, then clearly there must always be transitions in B corresponding to all the untraversed choices in the &-group in R. Because of the structure of choice groups, this means one of the following must be true:

The choice group in B is one or more groups, each containing all the choices in R - i.e., either (a | ... |b){1,unbounded} or ((a |...| b)...(a |...| b)), where the second is a list as long as the choices in the &-group in R.
The choice group is a tree of all the alternatives, where at each level there are at least those choices left from the previous level, i.e., if the grup in R is (a & b & c), the the group in B would be ((a, ((b,c)|(c,b))) | (b, ((a,c)|(c,a))) | (c, ((b,a)|(a,b)))).

If there is a sequence of choices similar to the first choice above, but not all of them contain all possible choices, then one of them, say the 3^rd, doesn't contain some choice, say c. Then the sequence doesn't support those alternatives with c in the 3^rd position. Therefore every choice group in the sequence must contain all the alternatives.

As mentioned this is very asymmetric - if all transitions from R are simple;, then it doesn't matter whether transitions in B are simple or &-group transitions. If the transition in R matches a simple transition in B, then we proceed as usual. If it matches an &-group transition in B, then the successor states in R must eventually match all the branches in B, but they only need to match one ordering of them. As a result, when traversing states for an &-group it is not necessary to check all sequences.

This provides enough information to generate an algorithm. Because we need to check all alternatives for each state, we need to explicitly manage the &-group stacks for both B and R. As with the traditional algorithm, we invert the states in the base, create a machine from the cross-product of the states of the input machines, and determine that there are no states in the resulting machine that includes accepting states from both input machines.

cross-product(q_R, q_B, R-stack, B-stack, output) is

if (&-group-only(q_R) and mixed(q_B)) then fail.

if &-group-only(q_R) and not &-group-only(q_B then

either

for each transition (q_R,q'_R,α,actions_R) there is a transition (q_R,q'_R,α,actions_R)

for each active transition (q_R,q'_R,α,actions_R) there is a transition (q_R,q'_R,α,actions_R) and not ((q_R,q_B),(q'_R,q'_B)α) ∈ output

or fail.

for each active transition (q_R, α, q'_R, actions_R) do

save top(R-stack) -> hold-R. (As the list of commands may include multiple push() and update() commands, this may involve storing more than one item from the stack. The process is reversed after the recursion.)

find transition an active transition (q_B, α, q'_B, actions_B) or fail.

save top(B-stack) -> hold-B.

if (not ((q_R,q_B), α, (q'_R,q'_B)) ∈ output) then

if the number of test() operations in actions_R ≠ number of test() operations in actions_B then fail.

add ((q_R,q_B), α, (q'_R,q'_B)) to output.

apply actions_R to R-stack.

apply actions_B to B-stack.

cross-product(q'_R, q'_B, R-stack, B-stack, output).

update R-stack from hold-R.

update B-stack from hold-B

We start the algorithm with the start states from both machines, two empty stacks, and an empty set for the output states.

3.3 An example

As a simple example, we test whether the expression above subsumes (a, ((d, e) | e), e). If we show it as above, (₀a₁, ((d₂, e₃) | e₄), e₅) we can see that the transitions are:

(0, 1, a, [])
(1, 2, d, [])
(1, 4, e, [])
(2, 3, e, [])
(3, 5, e, [])
(4, 5, e, [])

The only accepting state is 5.

We start at state 0 for both machines. Since we invert the accepting states of the base machine, its accepting states are 0 - 6, but not 7.

There is only one transition in R for state 0, and only one for B. We add ((0, 0), (1, 1), a) to the output. Neither stack is changed.
We recurse with R state 1 and B state 1. There are two transitions in R from state 1. We start with (1, 2, d, []). This matches (1, 4, d, [push()]) in B. We add ((1,1), (2,4), d) to the output machine and push 0 on the B stack.
We recurse with R state 2 and B state 4. There is one transition in R - (2, 3, e, []). There is likewise only one transition in B - (4, 5, e, []). We add ((2,4), (3, 5), e) to the output.
We recurse with R state 3 and B state 5. The transition in R is (3, 5, e, []). In B we have (5, 7, e, [update(1), test(2,3)]). We need to test the B transition. We xor 10 binary with the top of the B stack, or 00, to get 10. This is in the acceptable range (10 ≤ 10 ≤ 11). As we are transitioning out of the group we pop the B stack (now empty) and add ((3,5), (5,7), e) to the output.
We recurse again with R's state 5 and B's state 7. There are no transitions, so we return to the nearest branch (fixing the stacks as we go up), at 2 above. We take R's other transition, (1, 4, e, []). This matches (1, 6, e, [push()]) in B. We add ((1, 1), (4, 6), e) to the output, and push 0 on the B stack.
We recurse with 4 and 6. In R there is only (4, 5, e, []). In B this matches (6, 7, e, [update(1), test(2, 3)]). As with steps 4 we arrive at the end, adding ((4, 6), (5, 7), e) to the output.
We terminate as there are no choices left.

The states of the output machine are ((0,0), (1,1), (2,4), (3, 5), (5,7), (4,6)). As the only accepting state in R is 5, and the only non-accepting state in B is 7, we conclude that B subsumes R.

3.4 Subsumption with Numeric Exponents

Numeric exponents add significant complexity to subsumption testing. In particular, the nesting of numeric exponents, both in B and R, potentially require testing a large number of alternatives to ensure for each path through R there is at least one path through B. And yet, in many cases this is not necessary because of overlap in the areas covered by the exponents.

We will first give a couple of example of exponents and valid and invalid content models. From that, we will be able to extract some important aspects of the problem.

For example, consider the following: (a{4,5}{2,3}). Here we have a sequence of either 4 or 5 repetitions of a, repeated 2 or 3 times. This means the following are legal:

aaaaaaaa
aaaaaaaaa
aaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaa
aaaaaaaaaaaaaaa

In other words, sequences of 8 - 10 or 12 - 15 a's. However, if we have, instead, (a{4,5}{6,7}), the we have a series of 24-30 (from 6 X 4 to 6 X 5) and a series of 28 - 35. But since these overlap, we can just consider this a series of 24 to 35. I actually takes a certain amount of effort to construct example that do not have this overlap.

More generally, regarding subsumption, consider two content models, (ρ{m_0,,n₀}...{m_i,n_i}) and (β{m'₀,n'₀}...{m'_j,n'_j}). We wish to establish if the latter subsumes the former. Clearly we must first establish if β subsumes ρ. Then we must determine that all possible ranges of ρ are also possible ranges of β. We can enumerate the ranges by choosing all the different m and n for each subscript - from (m₀*m₁*...*m_i, n₀*m₁*...*m_i) to (m₀*n₁*...*n_i, n₀*n₁*...*n_i). The maximum number of distinct ranges of ρ is 2ⁱ and of β is 2^j. If β subsumes ρ, then we only need to determine that each range of ρ is contained within a range of β. We will use ranges(n) to refer to the ranges of the first n values in a list of exponents.

The more complex case is when the scoping is not so clear. For example, if B is (a | b){4,12} and R is (a{2,3},b{5,7}), then the iterations of a are insuficient without the additional iterations of b. Also consider B as (((a | b){3,5},c{0,1}){6,9},d) and R as ((a,b){20,25},c,d). Here, the repetitions of (a,b) in R must account for both the inner and outer exponents in B.

We can consider each terminal and each transition in B as having a nesting depth, depending on how many exponents are crossed from source to sink. Before a transition may be taken, all the "obligations" at higher levels must be fulfilled. For example, in B above, the d is at level 0 (not in the scope of any exponent) while a, b, and c are at level two. Therefore the transition from a to d is at level 0, and the exponent at levels 1 and 2 must have been satisfied before that transition could be crossed. At the other end, there is a transition from a to b at level 2, so that is counted against the exponent at that level ({3,5}), as well as a transition at level 1, so the number of iterations of a's and b's is either between 3 and 5, or over 18.

As with automata for &-group, we will need a stack for each exponent and some extra commands to attach to each transition. Each entry on the stack has an exponent and a counter to track the number of iterations completed. The counter has a pair to keep track of min and max iterations from the other automaton and a single number to track the number of matches from the innermost particle. In this case there are three kinds of command:

push({m,n}) pushes an entry on the stack, initially ({m,n}, 0). These commands are put on transitions entering the scope of an exponent.
increment increases the innermost counter. These commands are put on transitions that complete the particle contained in an exponent.
check(n) tests whether the counter satisfies the innermost n exponents.
upto(m) gives the maximum nesting level that a transition can affect.

It is possible for a single transition to have commands of all three types.

For example, the expression (₀((a₁ | b₂){10,11}²,c₃{0,1}²){6,9}¹,d₄) translates to the following transitions:

(0, 1, a, [push({6,9}), push({10,11}), increment])
(0, 1, b, [push({6,9}), push({10,11}), increment])
(1, 1, a, [increment])
(1, 2, b, [increment])
(1, 3, c, [check(2), upto(1), increment])
(1, 4, d, [check(1)])
(2, 1, a, [increment])
(2, 2, b, [increment])
(2, 3, c, [check(2), upto(1), increment])
(2, 4, d, [check(1)])
(3, 1, a, [push({10,11}), increment])
(3, 2, b, [push({10,11}), increment])
(3, 4, d, [check(1)])

The algorithm follows:

cross-product(q_R, q_B, stack_R, stack_B, output) is

for each transition (q_R,q'_R, α, actions_R) do

for each transition (q_B,q'_B, α, actions_B) where

((q_R,q_B),(q'_R,q'_B), α) not ∈ output

({m_R,n_R}, iter_R) = top(stack_R);

({m_B,n_B}, iter_B) = top(stack_B);

if check(i_R) = actions_R[0] and check(i_B) = actions_B[0] then do

for {k_R,l_R} in ranges_R(i)

for {k_B,l_B} in ranges_B(i_B)

if l_R * iter_B > k_B * iter_R and upto(j_B) = actions_B[1] then

newk = max(0, k_R * iter_B - k_B * iter_R)

newl = l_R * iter_B - k_B * iter_R

for {newm, newn} in range_B(j_B + 1)

pop stack_B to j_B

({m_B,n_B},iter_B) = top(stack_B);

set top of stack_B to ({m_B - (iter_B + newk div newm),n_B - (iter_B + newl div newm)}, 0)

add ((q_R, q_B), (q'_R, q'_B), α) to output.

pop stack_R

apply remaining actions to stacks.

cross-product(q'_R, q'_B, stack_R, stack_B, output)

restore top of stacks.

else if iter_R * k_B > iter_B * k_R

add ((q_R, q_B), (q'_R, q'_B), α) to output.

pop stack_R

set top of stack_B to ({(iter_R * k_B - iter_B * k_R) div iter_R, iter_R * l_B - (iter_B * l_R) div iter_R}, 0)

if (iter_R * l_B - iter_B * l_R) - (iter_R * k_B - iter_B * k_R) ≥ 0 then

cross-product(q'_R, q'_B, stack_R, stack_B, output)

else if iter_R * k_B ≤ iter_B * k_R ≤ iter_B * l_R ≤ iter_R * l_B then

pop stack_B

pop stack_R

apply remaining actions to stacks.

cross-product(q'_R, q'_B, stack_R, stack_B, output)

restore top of stacks.

else if check(i) = actions_R[0] and q'_B ≤ q_B (iteration) then

if iter_B * n_R < n_B do

save top of both stacks.

apply actions_R to stack_R.

set top of stack_B to ({m_B - iter_B* m_R,n_B - iter_B * m_R}, 0)

apply actions_B after the initial increment action to stack_B.

cross-product(q'_R, q'_B, stack_R, stack_B, output)

restore top of stack.

else if check(i) = actions_R(0) and

m_R ≤ iter_R ≤ n_R then

save top of both stacks.

apply actions to stacks.

cross-product(q'_R, q'_B, stack_R, stack_B, output)

restore both stacks.

else if check(j) = actions_B[0] then do

save top of both stacks.

apply actions to stacks.

if m_B ≤ iter_B ≤ n_B then

cross-product(q'_R, q'_B, stack_R, stack_B, output)

restore both stacks.

else

save top of both stacks.

apply actions to stacks.

cross-product(q'_R, q'_B, stack_R, stack_B, output)

restore both stacks.

3.5 An example

We'll now give an example of this algorithm in action. We'll use the expression ((a, b){40,43}, c, d), which translates to the following automaton:

(0, 1, a, [push({40, 43})]
(1, 2, b, [increment])
(2, 1, a, [])
(2, 3, c, [check(1)])
(3, 4, d, [])

First, we flip accepting states in B. As the only accepting state was 4, we now have 4 as the only non-accepting state. We then start with state 0 from both machines. For each, there is one transition across a. We push ({40, 43}, {0, 0}) on the R stack and both ({6, 9}, {0, 0}) and ({10, 11}, {0, 0}) on the B stack, then increment B, so the B stack is [({10, 11}, {1, 1}),({6, 9}, {0, 0})]. We are in states 1_R and 1_B. We add ((0, 0), (1, 1), a) to the output. Now the only transitions are across b in both. We execute (1, 2, b, [increment]) in R and then B. The R stack is now [({40, 43}, {1, 1})] and the B stack is [({10, 11}, {2, 2}),({6, 9}, {0, 0})]. We add ((1, 1), (2, 2), b) to the output.

There are now two possibilities in R, either transition 2 or 3 above. If we traverse the a path, we bring ourselves back to state 1 in both machines, but we add ((2, 2), (1, 1), a) to the output.

If we traverse the c, we find ourselves in the most complex part of the algorithm. The R transition is 4, above. The B transition is 5. Top of the R stack is ({40, 43}, {1, 1}), top of the B stack is ({10, 11}, {2, 2}). As 40 * 2 > 10 * 1, and 1 < 2 in check(2, 1), we take the first branch. We set newk to 70 and newl to 76. The only values for newm and newn are 10 and 11. As 6 ≤ 7 ≤ 7 ≤ 9, we set the B stack to [({0, 2}, {0, 0})]. We add ((2, 2), (3, 3) c) to the output and continue. We perform the last command from the B transition, and increment, and we pop the top of the R stack. The R stack is now empty and the B stack is [({0, 2}, {1, 1})]. From here there is one transition in R, across d, and one corresponding one in B. In R we end up in state 4 with an empty stack. In B, we check the first entry on the stack. As 0 ≤ 1 ≤ 1 ≤ 2, we pass and pop the stack. So we add ((3, 3), (4, 4), d) to the output. Checking the output, we see there is no state with an accepting state from both input machines, so we conclude that B subsumes R.

4 Extension to XML Schema

The algorithms as described work on terminals. However, XSD content models generally use non-terminals, such as references to substitution groups and namespace wildcards, which resolve to sets of qualified names. It is the qualified names that are the actual terminals. In this section we will outline how the algorithms given above can be extended to cover the XSD case.

We can see how XSD "terminals" reference sets in the following examples.

The declaration <element ref="foobar"/> matches not just the element named foobar, but any element in its substitution group.
The particle <any namespace="http://example.com"> matches any element in the given namespace.
An element in the substitution group of foobar can also be in the namespace "http://example.com".

Clearly, in determining UPA or subsumption we will need to consider these sets and their intersections.

While namespaces are flat, substitution groups form a tree - if R is in the substitution group of B, then the substitution group of R is a subset of the substitution group of B. Each node of this tree is a set; the structure is determined by declarations in the schema. From our perspective, we needn't worry about the particulars.

For UPA, the issue is no longer whether the same terminal appears twice, but whether any two sets named contain a common value. Since every non-empty set has a non-empty intersection with itself, we can update the UPA algorithm so that whenever two set names appear in a first or follow set together, we check that they have empty intersection. That way there can never be an occasion where the same qualified name appears in two transitions from the same state.

For subsumption testing, when we compare two transitions, such as (q_R,q'_R,ρ,[...]) from R and (q_B,q'_B,β,[...]) from B, it is not necessary that ρ and β be the same terminal, but only that ρ be a subset of β. That ensures that any qualified name matched by the transition in R is also matched by the transition in R.

It is an unfortunate side effect of this approach that UPA testing, in particular, needs to be done for every combination of element declarations a content model appears in, rather than just once. This limits the reusability of schema constructs. It also allows some very unexpected constructs; for example, a wildcard can be a restriction of a substitution group if it just so happens that all the elements defined in the namespace are also in the substitution group. There are ways to address this issue, but they are beyond the scope of this paper.

5 Conclusion

We've shown algorithms for UPA and subsumption testing for an extension to regular expression covering numerice exponents and general all-groups. Numeric exponents are a new feature added by XSD and all-groups were a feature of SGML dropped by XML 1.0 to simplify development. Both of these are important features where previously published algorithms were very expensive. While we've concentrated on the traditional notion of terminals, we have shown how XSD's wildcards and substitution groups can be handled.

The run time performance of these algorithms is very reasonable compared to alternatives. For example, naive algorithms that would unroll the numeric exponents and then determinise them in some for to test UPA would run in doubly exponential time - firstly, unrolling the exponents is O(n * e^d), where n is the size of a group, e is the size of the exponents, and d is their nesting depth. Given minimal and maximal values, the resulting automaton is highly non-deterministic, even where the original had UPA (for example (a{0,10}{0, 10}) expands to 100 (10²) optional a's). Any attempt to determinise this using the traditional subset construction method AHU is again exponential in the size of the input (as there may be an exponential blowup in states), giving O(2^{n * e^d}), which is fairly large. Any subset determination on this result is huge compared to the input.

In our case, the performance of the UPA algorithm remains quadratic with the input - only slightly worse than the algorithm on just kleene algorithms. The performance of the subsumption algorithm is only exponential in the depth of the nesting of exponents, or O(n*2^d). This is comparatively reasonable.

Bibliography

[MUR95] Murata Makoto, "Forest-Regular Languages and Tree-Regular Languages". Technical Report. Fuji Xerox, Japan, 1995.

[MUR99] Murata Makoto, "Hedge Automata: A Formal Model for XML Schemata", 1999.

[CHID00] Boris Chidlovskii, "Using Regular Tree Automata as XML Schemas", Proceedings of IEEE Advances in Digital Libraries 2000 Washington, USA, 2000, eds., J. Hoppenbrouwers, et al.

[TATA] Hubert Comon, et al, Tree Automata Techniques and Applications, 1998

[WXS] Henry Thompson, et al., XML Schema Part One: Structures, World Wide Web Consortium, May 2001, http://www.w3.org/TR/xmlschema-1/

[FD] Allen Brown, et al., XML Schema: Formal Description, World Wide Web Consortium, Sept. 2001, http://www.w3.org/TR/xmlschema-formal/

[USCM] Andreas Neumann, Unambiguity of SGML Content Models - Pushdown Automata Revisited 3rd Int. Conf. on the Developments in Language Theory (DLT'97), Juli 1997, Thessaloniki, Griechenland, in: Symeon Bozapalidis, Hrsg., Proceedings of the 3rd International Conference Developments in Language Theory, S. 507-518, Aristotle University of Thessaloniki, 1997

[AHU] Alfred Aho et al., The Design and Analysis of Computer Algorithms, Addison-Wesley Pub., 1974

Supporting UPA and Restriction on an Extension of XML Schema

Abstract

1 Introduction

2 Evaluating the Unique Particle Attribute Constraint

2.1 Plan of Attack

2.2 Sources of UPA violations

2.2.1 External violations

2.2.2 Internal ambiguity

2.3 Final UPA Algorithm

3 Testing Subsumption

3.1 Subsumption with &-groups

3.2 Comparing Two Automata

3.3 An example

3.4 Subsumption with Numeric Exponents

3.5 An example

4 Extension to XML Schema

5 Conclusion

Bibliography