Re: Ambiguity (what else!?) question from LdBeth on 2025-01-28 (public-ixml@w3.org from January 2025)

From: LdBeth <andpuke@foxmail.com>
Date: Tue, 28 Jan 2025 17:01:55 -0600
To: "Liam R. E. Quin" <liam@fromoldbooks.org>
Cc: David Birnbaum <djbpitt@gmail.com>, ixml <public-ixml@w3.org>
Message-ID: <tencent_EBFC69BAF175FB39BEF5AC60A0E9EC59A90A@qq.com>

>>>>> In <a23cd9c3353804f332788536e1ca7b1344056414.camel@fromoldbooks.org> 
>>>>>	"Liam R. E. Quin" <liam@fromoldbooks.org> wrote:
>> 
>> Regular expression *with* conditional text replacement is very
>> powerful, even if the regex library been used itself is limited to
>> regular languages (without backtracking/lookup ahead etc), with
>> control flow it can still be used to simulate these advanced PCRE
>> features with ease.

> Ironically i had filed an issue to add lookahead assertions to qt4, and
> it was turned down today. Maybe if iĸd been able to get to the meeting
> and argue for how useful they are... :(
> ...
> Regular expressions do of course allow backtracking (unlike SGML DTD
> content models), and back-references probably make them not be regular
> but iĸm not certain. Consider a.*b given abcdefbghi... 

The Unix/Plan 9 regexp library which compiles to NFA, by what I mean a
very limited version of regexp engine, does handle this without
backtracting. The trick is doing a bfs style thing that progresses
all the possible states at same time, until been narrowed done to one path
or reach the end of input.

The rationale is plan for the worst case, "In an NFA with n nodes,
there can only be n reachable states at any step, but there might be
2n paths through the NFA."

So for "a.*b" the NFA is

       +----------------+
       |                |
(a) ---@---> (.) ---+   +---> [b]
       ^            |
       +------------+

if the input is "abcdefbghi", when the second char "b" is accepted the
NFA tracks both the state (.) and the end state [b] with the accepted length.
After the second "b" is reached the previous state [b] is invalidated
and updated with a new length and the (.) state is still progressed until the
end of string, leaving the longest match "abcdefbg".

Of course back reference cannot be supported by this way and they are
also not "regular", not even non-greedy matches are supported.[1]

That aside, sed(1) is still Turing complete.

[1]: https://swtch.com/~rsc/regexp/regexp1.html

I'm not in favor of Thompson NFA nor backtracking based implementation
though, and I think XQuery/XSLT can largely benefit from a powerful
regexp engine.

> In a way the nearest thing to iXML might actually be Perl 6 regular
> expressions, that have an explicit grammar notation available. So i
> wonder if thereĸs debugging experience there.

https://www.txl.ca would be the closest I can think, which already has
a tokenizer applied before parsing so it handles many whitespace
insensitive programming languages, a transform language that I think
strikingly close resembles XSLT, allows ambiguity in grammar, and
resolves that by using type directed pattern matching.

> For something small, or for picking small pieces of data out of longer
> text, regular expressions are a win, because they are terse and because
> of .* (and .+ and .*? and friends) which means in effect you can write
> a partial grammar and rely on the regexp engine doing backtracking.

Agree. When I develop ixml markup, I find my self more often exploit the
ability to do backtracking.

Best,
ldbeth

Received on Tuesday, 28 January 2025 23:07:12 UTC