Re: Line/string triage? from Fredrik Öhrström on 2025-03-21 (public-ixml@w3.org from March 2025)

From: Fredrik Öhrström <oehrstroem@gmail.com>
Date: Fri, 21 Mar 2025 11:18:42 +0100
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: public-ixml@w3.org
Message-ID: <CALZT+jB=LakG=AHgDURoaAQnCwPhWcXMrT30T50yBECCQLjzug@mail.gmail.com>

I see another nice use case for a not construct:

identifier: [L]+, ![L].

This would be useful to terminate an identifier only at the end of the
string of letters.
The ![L]+ will not match anything, it will not generate any DOM/xml. It
merely checks
the next character.

When parsing programming languages where the identifier can end with space,
comma, minus, brace, etc.
then you cannot explicitly end the identifier in itself, and this causes an
ambiguity problem
(or at least an overburdened earley parser) since a single identifier
"floor" would generate many possible identifiers "f" "loor", "fl" "oor",
"flo" "or" etc "f" "l" "o" "o" "r".

//Fredrik










Den fre 21 mars 2025 kl 09:22 skrev Steven Pemberton <
steven.pemberton@cwi.nl>:

>
> Clearly a use case for the proposed 'not' construction...
>
>         chapterline: "Chapter ", ~[#a]+.
>         nonchapter: !"Chapter ",  ~[#a]+.
>
> Steven
>
> On Friday 21 March 2025 04:05:45 (+01:00), David Birnbaum wrote:
>
> > Dear ixml list,
> >
> > Is there a ixml idiom for distinguishing lines of input according to
> whether they do vs do not begin with a specific multicharacter pattern? For
> example, given consecutive lines, some of which begin with the string
> “Chapter “, I’d like recognize <chapterLine> and <nonChapterLine> elements.
> The former are easy, but I struggle to define a pattern that says “sequence
> of characters only when the first eight are not the string ‘Chapter ‘“.
> That is, I don’t know how to match a nonChapterLine without also possibly
> matching a chapterLine. I can use ~[“C”] to say “any *single* character
> that isn’t “C”, but a nonChapterLine could begin with “C” (or, for that
> matter, with “Ch” or “Cha”, etc.). Assuming I can be confident that a
> nonChapterLine cannot begin with the eight-character sequence “Chapter “,
> is it possible to construct an unambiguous grammar that will distinguish
> the two types of lines?
> >
> > I can think of workarounds, such as tagging only chapter lines and then
> managing the untagged stuff between them in a separate pipeline step. But
> is it possible to tag all lines of either type unambiguously with just a
> single ixml grammar? Thank you for any suggestions.
> >
> > Sincerely,
> >
> > David
> >
>
>

Received on Friday, 21 March 2025 10:19:14 UTC