AW: generators (was:) Re: QT4CG meeting 148 draft agenda, 13 January 2026

Hi Liam, hi Dimitre,

Thanks for your comprehensive work on the generator use cases.

Do you think it would be possible to provide an exemplary implementation for one of the simpler use cases, based on the existing XQuery generator implementation? It might help us to get a better idea of how the proposed concept compares with the existing functionality (based on folds, the while functions, or recursive functions) and with the recently added proposal for a single fn:generate function [1].

Best,
Christian

[1] https://qt4cg.org/pr/2350/xpath-functions-40/Overview.html#func-generate


________________________________
Von: Dimitre Novatchev <dnovatchev@gmail.com>
Gesendet: Dienstag, 13. Januar 2026 02:27
An: Liam R. E. Quin <liam@fromoldbooks.org>
Cc: public-xslt-40@w3.org <public-xslt-40@w3.org>
Betreff: Re: generators (was:) Re: QT4CG meeting 148 draft agenda, 13 January 2026

In addition to the nice work by Liam, here is a complete use-case, as was asked by the group.
The three cases described by Liam are also part of the processing / workflow in the use-cse below.

This was posted here:

https://github.com/qt4cg/qtspecs/issues/2380


________________________________________________

In response to:
QT4CG-147-02: NW to chase up DN and LQ about follow-up to the generator discussion

________________________________
Use Case: News Feeds Aggregation Using Generators
Contents
Use Case: News Feeds Aggregation Using Generators

  *   Actors
  *   Goals
  *   Functional Requirements
  *   Constraints / Assumptions / Preconditions
  *   Proposed High-Level Solution
  *   Known Approaches that are Problematic
  *   Benefits of the Generators Approach
  *   End-to-End Flow
     *   Brief Description of the Core Processes in the Pipeline
     *   Notes on the Process Pipeline
  *   Why This Fits the Generator Datatype Extremely Well
  *   Alternative Flows
     *   Alternative Flow-1: A Feed Temporarily Stops Producing New Items
     *   Alternative Flow-2: Partial Consumption of the Pipeline
     *   Alternative Flow-3: Editor Inserts or Reorders Items 11
  *   Exception Flows
     *   Exception Flow-1: Feed Unreachable or Network Failure
     *   Exception Flow-2: Malformed Feed Data
     *   Exception Flow-3: Resource Exhaustion Risk
  *   Postconditions
  *   References

________________________________
The Problem

Modern RSS/JSON aggregators must process hundreds of continuously updating feeds without excessive memory usage or latency, while supporting filtering, merging, and prioritization in real time.

________________________________
Actors

  *   End-User
  *   Editor
  *   Administrator
  *   System components (internal processes acting as secondary actors)
  *   External services (RSS providers, APIs, social signals)

________________________________
Goals

  *   End-User
“As a user, I want to get the latest, up-to-the-minute news from many important sources. I want each brief news item to be presented with a link to more detailed information from the original source.”

  *   Editor
“As an editor, I want to be alerted to any change in the aggregated news-stream, as it happens continuously, and to have powerful ways of inserting, reordering, appending, prepending or deleting one or more news-items.”

  *   Administrator
“As an administrator, I want to start, stop, or restart the system, manage the configured feeds, and monitor operational health and error conditions.”

________________________________
Functional Requirements

  *   Consume RSS / Atom / JSON-LD feeds incrementally
  *   Filter items by topic or sensitivity
  *   Merge multiple feeds chronologically
  *   Produce continuously updated summaries

________________________________
Constraints / Assumptions / Preconditions
Assumptions

  *   Feeds may be large or unbounded
  *   Items arrive over time

Constraint

  *   Memory usage must remain bounded

Preconditions

  *   At least one news feed is configured
  *   Feeds are RSS or JSON-LD and timestamped
  *   Items within a feed are presented in reverse-chronological order
  *   Each item contains a content-link or optionally - inline content
  *   Items may belong to multiple categories

________________________________
Proposed High-Level Solution

Each feed is modeled as a generator producing yield values lazily.
The ordered set of values produced by successive, demand-driven calls to move-next() is called the yield of the generator.

A generator’s yield may be finite or infinite, and may be empty for a given generator instance without implying exhaustion of the underlying data source.
Known Approaches That Are Problematic

These approaches require full materialization in memory:

  *   Eager sequences (XPath)
  *   DOM-style loading
  *   Materialized feeds

________________________________
Benefits of the Generators Approach

  *   Bounded memory usage
  *   Low latency
  *   Composability
  *   Deterministic control of evaluation

________________________________
End-to-End Flow

+-------------------------------+
| 1. Feed Fetching              |
| Input:  external providers    |
| Output: G_rawItems            |
+---------------+---------------+
                |
+---------------v---------------+
| 2. Normalization              |
| Input:  G_rawItems            |
| Output: G_normalizedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 3. Filtering                  |  <-- unwanted content removed
| Input:  G_normalizedItems     |
| Output: G_filteredItems       |
+---------------+---------------+
                |
+---------------v---------------+
| 4. Topic Classification       |
| Input:  G_filteredItems       |
| Output: G_classifiedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 5. Clustering                 |
| Input:  G_classifiedItems     |
| Output: G_clusteredItems      |
+---------------+---------------+
                |
+---------------v---------------+
| 6. Ranking                    |
| Input:  G_clusteredItems      |
| Output: G_rankedItems         |
+---------------+---------------+
                |
+---------------v---------------+
| 7. Summary Page Generation    |
| Input:  G_rankedItems         |
| Output: G_summaryPageItems,   |
|         HTML                  |
+---------------+---------------+
                |
+---------------v---------------+
| 8. Detail Page Generation     |
| Input:  G_summaryPageItems    |
| Output: HTML Detail Pages     |
+-------------------------------+


Remarks

  1.  The participating generator instances are named using the convention G_{name}.
  2.  Every stage except the final one produces a new generator.
  3.  Every stage except the very first uses a generator as its input.
  4.  Arrow semantics: the output generator of one stage is the input for the next stage.

________________________________
Brief Description of the Core Processes in the Pipeline
Process 1 — Feed Fetching & Acquisition

Goal:
Continuously pull RSS / Atom / JSON-LD feeds from CNN, Fox, NBC, BBC, etc.

Includes:

  *   Periodic polling (e.g., every 5 minutes)
  *   Detection of new items (GUID, URL hash, published timestamps)
  *   N-way merging to ensure the resulting yield is sorted in reverse-chronological order
  *   Basic sanity validation (e.g., XML schema validity)

Output:
A generator whose yield values are raw feed items (XML / JSON documents) → input to Process 2.

________________________________
Process 2 — Parsing & Normalization

Goal:
Convert heterogeneous raw feed items into a uniform internal format.

Normalized fields include:

  *   Title
  *   Description / Summary
  *   Full text (if available)
  *   URL
  *   Publication time (converted to UTC)
  *   Source
  *   Images, categories, tags
  *   Named entities (optional NLP-based enrichment)

Output:
A generator yielding clean, normalized NewsItem documents → input to Process 3.

________________________________
Process 3 — Content Filtering & Exclusion Rules

Goal:
Remove unwanted items early using configurable rule sets.

Examples:

  *   Blocked topics: politics, celebrity gossip, violence, etc.
  *   Blocked entities: Donald Trump, Joe Biden, Kanye West, etc.
  *   Blocked publishers (optional)
  *   Expiration rules:
     *   Tech news stale after 48 hours
     *   Breaking news stale after 6 hours

Techniques:

  *   Keyword filtering
  *   Named Entity Recognition (NER)
  *   Sensitive-topic classifiers (ML-based)
  *   Freshness scoring

Output:
A generator yielding allowed, filtered NewsItem documents → input to Process 4.
Rejected items are stored separately for auditing.

________________________________
Process 4 — Topic Classification

Goal:
Assign each item to one or more topics.

Example topics:

  *   Politics
  *   World
  *   Tech
  *   Health
  *   Sports
  *   Business
  *   Disasters / Urgent events
  *   Crime / Safety
  *   Entertainment

Approaches:

  *   Fine-tuned BERT classifier (preferred)
  *   TF-IDF + SVM (simpler)
  *   Feed-provided category tags (fallback)

Output:
A generator yielding categorized NewsItem documents → input to Process 5.

________________________________
Process 5 — Similarity Analysis & Clustering

Goal:
Group news items from different sources describing the same event.

Techniques:

  *   Semantic vector embeddings (e.g., SBERT, Ada embeddings)
  *   Cosine similarity
  *   Hierarchical clustering or DBSCAN

Produces:

  *   Clusters of highly similar articles
  *   A primary (best) representative per cluster

Output:
A generator yielding clusters of related articles → input to Process 6.

Note:
To better match streaming behavior, clustering may operate within bounded windows (e.g., sliding windows) while still consuming the input generator.

________________________________
Process 6 — Ranking, Urgency, and Freshness Scoring

Goal:
Prioritize which news appears on the Summary Page.

Computed scores:

  *   Freshness score (more recent → higher)
  *   Urgency score (disasters, crises, violence)
  *   Coverage score (number of sources reporting)
  *   Engagement score (optional: social signals)

Weighted formula:

FinalScore = a*Urgency + b*Freshness + c*Coverage + d*EditorRules

Items with the highest scores per topic are selected.

This stage does not require a full total ordering; instead a partial ordering (e.g., top-K per topic) preserves bounded memory.

Editor-driven operations (insert, remove, reorder) are modeled as generator transformations applied downstream of ranking.

Output:
A generator yielding ranked clusters → input to Process 7.

________________________________
Process 7 — Summary Page Generation

This stage consumes the input generator and produces finite views intended for presentation.

Goal:
Build a continuously updated Summary Page (“Front Page”) containing:

  *   Top events per topic
  *   Short summaries
  *   Links to primary articles
  *   “Read similar news” (cluster siblings)
  *   Source icons
  *   Timestamp of most recent update

The page auto-refreshes and always reflects the newest items.

________________________________
Process 8 — Detailed Pages & Cross-Links

This stage consumes its input generator and produces finite presentation views.

For each cluster:

  *   Canonical article (primary representative)
  *   Related articles across sources
  *   Timeline of developments
  *   Additional metadata (images, entities, tags)

Cross-links include:

  *   “More like this…”
  *   “Earlier developments…”
  *   “Follow-up stories…”

________________________________
Notes on the Process Pipeline

  *   Feed Fetching typically wraps one or more data providers
→ produces G_rawItems lazily (RSS, JSON APIs, DB cursors, web services)
  *   Every stage is expressible as:
     *   for-each, filter, append, prepend, insert-at, remove-where, concat, or fold, etc., producing a new generator derived from the previous one
  *   No stage requires full materialization unless explicitly demanded
(e.g., to-array, bounded sort, pagination)
  *   Infinite generators are valid until stage 6; stages 7–8 typically consume finite prefixes (take(n))

________________________________
Why This Fits the Generator Datatype Extremely Well

  *   The pipeline is a composition of generator transformers
  *   Each box maps almost 1-to-1 to generator operations
  *   External data providers integrate naturally at Stage 1
  *   Sorting can be introduced in different ways:
     *   External merge-sort over generators
     *   Bounded-window ranking
     *   Top-K lazy ranking – e.g. using heaps.

________________________________
Alternative Flows
Alternative Flow 1 — Feed Temporarily Stops Producing New Items

Condition:
A feed is reachable but has no new items since the last polling cycle.

Flow:

  1.  The feed generator advances (move-next()).
  2.  The data provider returns no new items.
  3.  The feed-generator instance yields no items during this interval.
  4.  Downstream generators remain operational.
  5.  If all feeds are empty, no new items are added downstream.

Result:
The pipeline continues uninterrupted; no special handling is required.

________________________________
Alternative Flow 2 — Partial Consumption of the Pipeline

Condition:
Only a finite prefix of the stream is required (e.g., top N items).

Flow:

  1.  Downstream consumers apply take(N).
  2.  Upstream generators are evaluated only as needed.
  3.  Remaining potential yield values are never materialized.

Result:
Latency and memory usage remain bounded. The pipeline supports early termination naturally.

________________________________
Alternative Flow 3 — Editor Inserts or Reorders Items

Condition:
An editor manually modifies the aggregated stream.

Flow:

  1.  Editor operations are applied as generator transformations
(append, prepend, insert-at, remove-at, remove-where).
  2.  A new generator with the modified yield is produced.
  3.  Downstream stages consume it transparently.

Result:
Editorial control integrates seamlessly without breaking the pipeline.

________________________________
Exception Flows
Exception Flow 1 — Feed Unreachable or Network Failure

Condition:
A feed cannot be reached during polling.

Flow:

  1.  The data provider reports an error or timeout.
  2.  The next instance of the feed generator yields no items during this polling interval.
  3.  The error is logged for monitoring.
  4.  A retry policy (e.g., exponential backoff) is applied.

Result:
The system continues operating with remaining feeds.

________________________________
Exception Flow 2 — Malformed Feed Data

Condition:
A feed item is malformed (invalid XML/JSON or schema validation problems, e.g. missing required fields).

Flow:

  1.  The normalization stage detects the issue.
  2.  The item is discarded or quarantined.
  3.  Processing continues with subsequent items.

Result:
Malformed data does not propagate downstream.

________________________________
Exception Flow 3 — Resource Exhaustion Risk

Condition:
A downstream operation risks exceeding memory limits.

Flow:

  1.  Bounded strategies (windowing, top-K selection) are applied.
  2.  Full materialization is avoided.
  3.  If needed, the operation degrades gracefully (e.g., reduced clustering depth).

Result:
System stability is preserved under load.

________________________________
Postconditions

Upon successful execution:

Functional Outcomes

  *   End users see an up-to-date Summary Page.
  *   Each summary item links to a Detailed Page.
  *   Editors can intervene using generator operations.
  *   Administrators retain full system control.

Technical Guarantees

  *   Memory usage remains bounded.
  *   Latency is minimized through lazy evaluation.
  *   Full materialization occurs only when explicitly requested.

System State

  *   All generators remain composable.
  *   Generator composition remains valid after alternative and exceptional flows.
  *   Empty generators correctly represent exhaustion.
  *   Infinite yields are supported up to stages that require finiteness.

________________________________
References

  1.  RSS 2.0 Specification
https://www.rssboard.org/rss-specification


  2.  Atom Publishing Protocol (RFC 5023)
https://www.rfc-editor.org/rfc/rfc5023


  3.  JSON-LD Specification
https://json-ld.org/spec/


  4.  TF-IDF, “Understanding TF-IDF (Term Frequency-Inverse Document Frequency)”,
https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/


  5.  TF-IDF + SVM, “Strengthening Fake News Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques. Defying BERT?”,
https://arxiv.org/html/2411.12703v1


  6.  Sentence-BERT (SBERT)
Reimers, N. & Gurevych, I., 2019
https://arxiv.org/abs/1908.10084


  7.  Fine-tuned BERT, “Fine-tuning a BERT model”,
https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert


  8.  Ada Embeddings (OpenAI)
Radford et al., 2021
https://arxiv.org/abs/2103.00020


  9.  Cosine Similarity
https://en.wikipedia.org/wiki/Cosine_similarity


  10. Hierarchical Clustering
https://en.wikipedia.org/wiki/Hierarchical_clustering


  11. DBSCAN
Ester et al., 1996
https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf



On Mon, Jan 12, 2026 at 5:11 PM Liam R. E. Quin <liam@fromoldbooks.org<mailto:liam@fromoldbooks.org>> wrote:
On Mon, 12 Jan 2026 11:21:03 +0000
Norm Tovey-Walsh <norm@saxonica.com<mailto:norm@saxonica.com>> wrote:

>
> 2.1. PR #2350: 708 An alternative proposal for generators
>
>    See PR [37]#2350.
>

Thanks, i’ll try to be there, and i also enclose the beginnings of some
use cases (in case anyone has time to read them, sorry for short notice
here)

For me, it’s not that there is anything that can’t be done without
generators, since Dimitre has implemented them in qt4 for BaseX
already. And it might be that a much smaller, core, proposal is enough.

You can also implement regular expressions yourself in XPath, as an
extreme example, using codepoint and string manipulation, but you
seriously don’t want to.

Here, generators are small, and are a common paradigm. I don’t want to
overstate the case for them - an alternative might be to consider an
equivalent to xsl:iterate for XPath (and hence XQuery): a function form
that’s guaranteed to be optimized into a loop. But that’s lower level.

If we had a standard paradigm for writing generators, they would be
used for the random number generator generator. Its exactly the same
idea: you call random-number-generator($seed?) and you get back a
generator. You then use that generator repeatedly to get random numbers.

Generators are useful to abstract that idea; some examples:

1. Mutual Recursion

Mutual recursion is common e.g. in recursive descent parsers, where
multiple functions need to operate on the next token in the input,
possibly hiding things like macro expansion (e.g. parameter entites)
from the parser. So you make a get-next-token generator whose next()
function can go off into another file when needed.

2. Avoiding computation

Example: Take two documents with roughly the same content but not in
the same order, and use text similarity to match paragraphs.

Here, the similarity function is expensive to compute, but a simple
approximation is enough to classify paragraphs into obviously
dissimilar and possibly similar. So you want a next-most-likely
function, but you don't want to compute similarity for the whole
sequence in advance.

3. N-way merge

Example: make a sequence of up to _k_ sections from multiple documents
based on section title or keyword relevance; each document can contain
multiple sections, and each section must be included only once, based
on the topic sequence in the original.

The usual way to do this in “QT” is with a recursive function or
template that chooses a single element at each level, but having a
generator that can return several elements when appropriate can
considerably simplify the logic.

Of course, you can write this with helper functions (and i do, today).
None of this is about things you can’t do today, but only about things
that are tricky to get right, and bringing in familiar ideas from other
languages.

A fold-left-while, by analogy with take-while, might be a useful
addition for solving complex problems, but that goes back to
xsl:iterate too.

hope this helps,

liam

--
Liam Quin: Delightful Computing - Training and Consultancy in
XSLT / XML Markup / Typography / CSS / Accessibility / and more...
Outreach for the GNU Image Manipulation Program
Vintage art digital files - fromoldbooks.org<http://fromoldbooks.org>



--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what you're doing is work or play
-------------------------------------
To achieve the impossible dream, try going to sleep.
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they write all patents, too? :)
-------------------------------------
Sanity is madness put to good use.
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.

Received on Tuesday, 13 January 2026 09:14:25 UTC