Re: [xml-dev] New release (2.8) of XSV from Burak Emir on 2004-10-11 (xmlschema-dev@w3.org from October 2004)

From: Burak Emir <Burak.Emir@epfl.ch>
Date: Mon, 11 Oct 2004 17:50:33 +0200
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
CC: xmlschema-dev@w3.org, xml-dev@lists.xml.org
Message-ID: <416AABC9.90208@epfl.ch>

Sorry to play the spoiler. I don't want to say that the improvement is 
not significant, however...

Henry S. Thompson wrote:

>Jeff Rafter <lists@jeffrafter.com> writes:
>
>  
>
>>Does this mean that you have modified the approach set forth in
>>http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html? 
>>    
>>
>
>Yes.  That approach unfolds numeric exponents, producing a number of
>states which grows linearly with maxOccurs for elements and
>non-nesting groups, and grows exponentially for nesting groups, that
>is if you have (a, b{2,5}, c){1,100} the number of states is
>Order(500).
>
>  
>
That is not exponential, but linear in the product of all the "maxOccurs".

>The new approach is linear in the number of particles, because it uses
>counters, not unfolding.
>
>  
>
Strictly speaking, it was also linear before, since all numeric constants
are constant, there is no change in complexity (which is blind for 
constant factor improvements).

In practice, these constants are very large (and most probably much 
larger than the number of particles of the regular expression).

You don't need to argue about computational complexity here to underline 
the improvement.

>>If so, do you have any metrics on changes to the time complexity?
>>    
>>
>
>Should not change the _runtime_ complexity at all, which was always
>the normal FSM complexity.  Changes the _compile-time_ complexity
>significantly, from exponential to linear.
>
>  
>
Am I missing something here?

Actually, the best compile time complexity is quadratic (in number of 
particles), no matter whether you unfold or not.

That is a direct consequence of the fact that all regexps are 
1-unambiguous, so the Berry-Sethi construction [1] yields a 
deterministic automaton (if they were not 1-unambiguous, you would need 
to apply this construction, and make the automaton deterministic 
afterwards, which is worst-case exponential time because of state-space 
explosion).

The worst case size of such an outcome of the Berry-Sethi construction 
for a regular expression is quadratic (where size is the number of 
transitions). If you define size as number of states, it is linear, but 
this is not very useful to assess time-complexity, because it is the 
transitions that are actually stored somewhere, not the states.

If your compile-time complexity really was exponential, it just means 
you used a construction that was not optimal. In particular, using 
epsilon transitions is a waste of time (though they make things much 
easier) - they give you a nondeterministic automaton, even though you 
could have constructed a deterministic one right away!

Hope you find this useful.

cheers,
Burak

http://lamp.epfl.ch/~buraq

[1] for Berry-Sethi, rather then looking up their original paper (Gerard 
Berry,Ravi Sethi "From regular expressions to deterministic automata" 
Th.Comp.Sc.), get a more recent perspective from <shameless_plug/>
Burak Emir "Compiling Regular Patterns to Sequential Machines"
http://icwww.epfl.ch/publications/ documents/IC_TECH_REPORT_200472.pdf

or search for papers by Anne Brueggemann-Klein.

Received on Monday, 11 October 2004 15:50:47 UTC