- From: LdBeth <andpuke@foxmail.com>
- Date: Tue, 28 Jan 2025 17:01:55 -0600
- To: "Liam R. E. Quin" <liam@fromoldbooks.org>
- Cc: David Birnbaum <djbpitt@gmail.com>, ixml <public-ixml@w3.org>
- Message-ID: <tencent_EBFC69BAF175FB39BEF5AC60A0E9EC59A90A@qq.com>
>>>>> In <a23cd9c3353804f332788536e1ca7b1344056414.camel@fromoldbooks.org> >>>>> "Liam R. E. Quin" <liam@fromoldbooks.org> wrote: >> >> Regular expression *with* conditional text replacement is very >> powerful, even if the regex library been used itself is limited to >> regular languages (without backtracking/lookup ahead etc), with >> control flow it can still be used to simulate these advanced PCRE >> features with ease. > Ironically i had filed an issue to add lookahead assertions to qt4, and > it was turned down today. Maybe if i¢d been able to get to the meeting > and argue for how useful they are... :( > ... > Regular expressions do of course allow backtracking (unlike SGML DTD > content models), and back-references probably make them not be regular > but i¢m not certain. Consider a.*b given abcdefbghi... The Unix/Plan 9 regexp library which compiles to NFA, by what I mean a very limited version of regexp engine, does handle this without backtracting. The trick is doing a bfs style thing that progresses all the possible states at same time, until been narrowed done to one path or reach the end of input. The rationale is plan for the worst case, "In an NFA with n nodes, there can only be n reachable states at any step, but there might be 2n paths through the NFA." So for "a.*b" the NFA is +----------------+ | | (a) ---@---> (.) ---+ +---> [b] ^ | +------------+ if the input is "abcdefbghi", when the second char "b" is accepted the NFA tracks both the state (.) and the end state [b] with the accepted length. After the second "b" is reached the previous state [b] is invalidated and updated with a new length and the (.) state is still progressed until the end of string, leaving the longest match "abcdefbg". Of course back reference cannot be supported by this way and they are also not "regular", not even non-greedy matches are supported.[1] That aside, sed(1) is still Turing complete. [1]: https://swtch.com/~rsc/regexp/regexp1.html I'm not in favor of Thompson NFA nor backtracking based implementation though, and I think XQuery/XSLT can largely benefit from a powerful regexp engine. > In a way the nearest thing to iXML might actually be Perl 6 regular > expressions, that have an explicit grammar notation available. So i > wonder if there¢s debugging experience there. https://www.txl.ca would be the closest I can think, which already has a tokenizer applied before parsing so it handles many whitespace insensitive programming languages, a transform language that I think strikingly close resembles XSLT, allows ambiguity in grammar, and resolves that by using type directed pattern matching. > For something small, or for picking small pieces of data out of longer > text, regular expressions are a win, because they are terse and because > of .* (and .+ and .*? and friends) which means in effect you can write > a partial grammar and rely on the regexp engine doing backtracking. Agree. When I develop ixml markup, I find my self more often exploit the ability to do backtracking. Best, ldbeth
Received on Tuesday, 28 January 2025 23:07:12 UTC