Line/string triage?

Dear ixml list,

Is there a ixml idiom for distinguishing lines of input according to whether they do vs do not begin with a specific multicharacter pattern? For example, given consecutive lines, some of which begin with the string “Chapter “, I’d like recognize <chapterLine> and <nonChapterLine> elements. The former are easy, but I struggle to define a pattern that says “sequence of characters only when the first eight are not the string ‘Chapter ‘“. That is, I don’t know how to match a nonChapterLine without also possibly matching a chapterLine. I can use ~[“C”] to say “any *single* character that isn’t “C”, but a nonChapterLine could begin with “C” (or, for that matter, with “Ch” or “Cha”, etc.). Assuming I can be confident that a nonChapterLine cannot begin with the eight-character sequence “Chapter “, is it possible to construct an unambiguous grammar that will distinguish the two types of lines?

I can think of workarounds, such as tagging only chapter lines and then managing the untagged stuff between them in a separate pipeline step. But is it possible to tag all lines of either type unambiguously with just a single ixml grammar? Thank you for any suggestions.

Sincerely,

David

Received on Friday, 21 March 2025 03:06:03 UTC