- From: Burak Emir <Burak.Emir@epfl.ch>
- Date: Mon, 11 Oct 2004 17:50:33 +0200
- To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
- CC: xmlschema-dev@w3.org, xml-dev@lists.xml.org
Sorry to play the spoiler. I don't want to say that the improvement is not significant, however... Henry S. Thompson wrote: >Jeff Rafter <lists@jeffrafter.com> writes: > > > >>Does this mean that you have modified the approach set forth in >>http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html? >> >> > >Yes. That approach unfolds numeric exponents, producing a number of >states which grows linearly with maxOccurs for elements and >non-nesting groups, and grows exponentially for nesting groups, that >is if you have (a, b{2,5}, c){1,100} the number of states is >Order(500). > > > That is not exponential, but linear in the product of all the "maxOccurs". >The new approach is linear in the number of particles, because it uses >counters, not unfolding. > > > Strictly speaking, it was also linear before, since all numeric constants are constant, there is no change in complexity (which is blind for constant factor improvements). In practice, these constants are very large (and most probably much larger than the number of particles of the regular expression). You don't need to argue about computational complexity here to underline the improvement. >>If so, do you have any metrics on changes to the time complexity? >> >> > >Should not change the _runtime_ complexity at all, which was always >the normal FSM complexity. Changes the _compile-time_ complexity >significantly, from exponential to linear. > > > Am I missing something here? Actually, the best compile time complexity is quadratic (in number of particles), no matter whether you unfold or not. That is a direct consequence of the fact that all regexps are 1-unambiguous, so the Berry-Sethi construction [1] yields a deterministic automaton (if they were not 1-unambiguous, you would need to apply this construction, and make the automaton deterministic afterwards, which is worst-case exponential time because of state-space explosion). The worst case size of such an outcome of the Berry-Sethi construction for a regular expression is quadratic (where size is the number of transitions). If you define size as number of states, it is linear, but this is not very useful to assess time-complexity, because it is the transitions that are actually stored somewhere, not the states. If your compile-time complexity really was exponential, it just means you used a construction that was not optimal. In particular, using epsilon transitions is a waste of time (though they make things much easier) - they give you a nondeterministic automaton, even though you could have constructed a deterministic one right away! Hope you find this useful. cheers, Burak http://lamp.epfl.ch/~buraq [1] for Berry-Sethi, rather then looking up their original paper (Gerard Berry,Ravi Sethi "From regular expressions to deterministic automata" Th.Comp.Sc.), get a more recent perspective from <shameless_plug/> Burak Emir "Compiling Regular Patterns to Sequential Machines" http://icwww.epfl.ch/publications/ documents/IC_TECH_REPORT_200472.pdf or search for papers by Anne Brueggemann-Klein.
Received on Monday, 11 October 2004 15:50:47 UTC