RE: Ambiguous content models -- allowed or disallowed by XSDL? from Lehmann, Steen on 2002-06-13 (xmlschema-dev@w3.org from June 2002)

From: Lehmann, Steen <slehmann@silverstream.com>
Date: Thu, 13 Jun 2002 14:00:16 +0200
To: "'Ian Stokes-Rees'" <ijs@decisionsoft.com>
Cc: xmlschema-dev@w3.org
Message-ID: <FD8A740186525243807FBCDB7E739D0585F3A6@EXBE1.silverstream.be>

>> Now, to turn this NFA into a corresponding DFA we "collapse" the empty
>> transitions from each state and copy transitions from states we can reach
by
>> following empty transitions back to the original state (known as
evaluating
>> lambda closure). The resulting DFA has no empty transitions, so at each
>> state we know immediately if a parsed symbol is legal (there is a
transition
>> somewhere for that symbol) without a need for backtracking. 
 
> Without working it out in detail, my guess is that you get exponential
> growth in the number of required states if you have multiple nested
> ambiguous content models.  In practical terms, while it would be
> theoretically possible, it is my guess that either the 
> implementation or
> the operation (execution) of a parser which allowed this would be
> difficult.  Is that a fair assessment?

The worst case scenario for a complex expression is indeed an exponential
number of states, the text book example being the expression

(a|b)* a (a|b)(a|b)(a|b) ... (a|b) 

where (a|b) is repeated N times - this requires 2^N states. Fortunately
there are ways around this, lazily constructing the DFA being one of them. A
good reference to parsing algorithms and optimization is: "Compilers:
Principles, Techniques and Tools" by Aho, Sethi and Ullman (referred to as
the "dragon" book).

Whether or not this is difficult is a matter of taste, I suppose. If you
have previous experience with parsing you may find it easy because it uses
well understood algorithms, on the other hand I would not want to use this
particular approach if I had to write a validator using a scripting
language.

-- Steen

Received on Thursday, 13 June 2002 08:00:49 UTC