Re: generators (was:) Re: QT4CG meeting 148 draft agenda, 13 January 2026

In addition to the nice work by Liam, here is a complete use-case, as was
asked by the group.
The three cases described by Liam are also part of the processing /
workflow in the use-cse below.

This was posted here:

https://github.com/qt4cg/qtspecs/issues/2380

________________________________________________

In response to:
QT4CG-147-02: NW to chase up DN and LQ about follow-up to the generator
discussion
------------------------------
Use Case: News Feeds Aggregation Using GeneratorsContentsUse Case: News
Feeds Aggregation Using Generators

   - Actors
   - Goals
   - Functional Requirements
   - Constraints / Assumptions / Preconditions
   - Proposed High-Level Solution
   - Known Approaches that are Problematic
   - Benefits of the Generators Approach
   - End-to-End Flow
      - Brief Description of the Core Processes in the Pipeline
      - Notes on the Process Pipeline
   - Why This Fits the Generator Datatype Extremely Well
   - Alternative Flows
      - Alternative Flow-1: A Feed Temporarily Stops Producing New Items
      - Alternative Flow-2: Partial Consumption of the Pipeline
      - Alternative Flow-3: Editor Inserts or Reorders Items 11
   - Exception Flows
      - Exception Flow-1: Feed Unreachable or Network Failure
      - Exception Flow-2: Malformed Feed Data
      - Exception Flow-3: Resource Exhaustion Risk
   - Postconditions
   - References

------------------------------
The Problem

Modern RSS/JSON aggregators must process hundreds of continuously updating
feeds without excessive memory usage or latency, while supporting
filtering, merging, and prioritization in real time.
------------------------------
Actors

   - End-User
   - Editor
   - Administrator
   - System components (internal processes acting as secondary actors)
   - External services (RSS providers, APIs, social signals)

------------------------------
Goals

   -

   End-User
   “As a user, I want to get the latest, up-to-the-minute news from many
   important sources. I want each brief news item to be presented with a link
   to more detailed information from the original source.”
   -

   Editor
   “As an editor, I want to be alerted to any change in the aggregated
   news-stream, as it happens continuously, and to have powerful ways of
   inserting, reordering, appending, prepending or deleting one or more
   news-items.”
   -

   Administrator
   “As an administrator, I want to start, stop, or restart the system,
   manage the configured feeds, and monitor operational health and error
   conditions.”

------------------------------
Functional Requirements

   - Consume RSS / Atom / JSON-LD feeds incrementally
   - Filter items by topic or sensitivity
   - Merge multiple feeds chronologically
   - Produce continuously updated summaries

------------------------------
Constraints / Assumptions / PreconditionsAssumptions

   - Feeds may be large or unbounded
   - Items arrive over time

Constraint

   - Memory usage must remain bounded

Preconditions

   - At least one news feed is configured
   - Feeds are RSS or JSON-LD and timestamped
   - Items within a feed are presented in reverse-chronological order
   - Each item contains a content-link or optionally - inline content
   - Items may belong to multiple categories

------------------------------
Proposed High-Level Solution

Each feed is modeled as a generator producing yield values lazily.
The ordered set of values produced by successive, demand-driven calls to
move-next() is called the yield of the generator.
A generator’s yield may be finite or infinite, and may be empty for a given
generator instance without implying exhaustion of the underlying data
source.Known Approaches That Are Problematic

These approaches require full materialization in memory:

   - Eager sequences (XPath)
   - DOM-style loading
   - Materialized feeds

------------------------------
Benefits of the Generators Approach

   - Bounded memory usage
   - Low latency
   - Composability
   - Deterministic control of evaluation

------------------------------
End-to-End Flow

+-------------------------------+
| 1. Feed Fetching              |
| Input:  external providers    |
| Output: G_rawItems            |
+---------------+---------------+
                |
+---------------v---------------+
| 2. Normalization              |
| Input:  G_rawItems            |
| Output: G_normalizedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 3. Filtering                  |  <-- unwanted content removed
| Input:  G_normalizedItems     |
| Output: G_filteredItems       |
+---------------+---------------+
                |
+---------------v---------------+
| 4. Topic Classification       |
| Input:  G_filteredItems       |
| Output: G_classifiedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 5. Clustering                 |
| Input:  G_classifiedItems     |
| Output: G_clusteredItems      |
+---------------+---------------+
                |
+---------------v---------------+
| 6. Ranking                    |
| Input:  G_clusteredItems      |
| Output: G_rankedItems         |
+---------------+---------------+
                |
+---------------v---------------+
| 7. Summary Page Generation    |
| Input:  G_rankedItems         |
| Output: G_summaryPageItems,   |
|         HTML                  |
+---------------+---------------+
                |
+---------------v---------------+
| 8. Detail Page Generation     |
| Input:  G_summaryPageItems    |
| Output: HTML Detail Pages     |
+-------------------------------+

Remarks

   1. The participating generator instances are named using the convention
   G_{name}.
   2. Every stage except the final one produces a new generator.
   3. Every stage except the very first uses a generator as its input.
   4. Arrow semantics: the output generator of one stage is the input for
   the next stage.

------------------------------
Brief Description of the Core Processes in the PipelineProcess 1 — Feed
Fetching & Acquisition

Goal:
Continuously pull RSS / Atom / JSON-LD feeds from CNN, Fox, NBC, BBC, etc.

Includes:

   - Periodic polling (e.g., every 5 minutes)
   - Detection of new items (GUID, URL hash, published timestamps)
   - N-way merging to ensure the resulting yield is sorted in
   reverse-chronological order
   - Basic sanity validation (e.g., XML schema validity)

Output:
A generator whose yield values are raw feed items (XML / JSON documents) →
input to Process 2.
------------------------------
Process 2 — Parsing & Normalization

Goal:
Convert heterogeneous raw feed items into a uniform internal format.

Normalized fields include:

   - Title
   - Description / Summary
   - Full text (if available)
   - URL
   - Publication time (converted to UTC)
   - Source
   - Images, categories, tags
   - Named entities (optional NLP-based enrichment)

Output:
A generator yielding clean, normalized NewsItem documents → input to
Process 3.
------------------------------
Process 3 — Content Filtering & Exclusion Rules

Goal:
Remove unwanted items early using configurable rule sets.

Examples:

   - Blocked topics: politics, celebrity gossip, violence, etc.
   - Blocked entities: Donald Trump, Joe Biden, Kanye West, etc.
   - Blocked publishers (optional)
   - Expiration rules:
      - Tech news stale after 48 hours
      - Breaking news stale after 6 hours

Techniques:

   - Keyword filtering
   - Named Entity Recognition (NER)
   - Sensitive-topic classifiers (ML-based)
   - Freshness scoring

Output:
A generator yielding allowed, filtered NewsItem documents → input to
Process 4.
Rejected items are stored separately for auditing.
------------------------------
Process 4 — Topic Classification

Goal:
Assign each item to one or more topics.

Example topics:

   - Politics
   - World
   - Tech
   - Health
   - Sports
   - Business
   - Disasters / Urgent events
   - Crime / Safety
   - Entertainment

Approaches:

   - Fine-tuned BERT classifier (preferred)
   - TF-IDF + SVM (simpler)
   - Feed-provided category tags (fallback)

Output:
A generator yielding categorized NewsItem documents → input to Process 5.
------------------------------
Process 5 — Similarity Analysis & Clustering

Goal:
Group news items from different sources describing the same event.

Techniques:

   - Semantic vector embeddings (e.g., SBERT, Ada embeddings)
   - Cosine similarity
   - Hierarchical clustering or DBSCAN

Produces:

   - Clusters of highly similar articles
   - A primary (best) representative per cluster

Output:
A generator yielding clusters of related articles → input to Process 6.

Note:
To better match streaming behavior, clustering may operate within bounded
windows (e.g., sliding windows) while still consuming the input generator.
------------------------------
Process 6 — Ranking, Urgency, and Freshness Scoring

Goal:
Prioritize which news appears on the Summary Page.

Computed scores:

   - Freshness score (more recent → higher)
   - Urgency score (disasters, crises, violence)
   - Coverage score (number of sources reporting)
   - Engagement score (optional: social signals)

Weighted formula:

FinalScore = a*Urgency + b*Freshness + c*Coverage + d*EditorRules

Items with the highest scores per topic are selected.

This stage does not require a full total ordering; instead a partial
ordering (e.g., top-K per topic) preserves bounded memory.

Editor-driven operations (insert, remove, reorder) are modeled as generator
transformations applied downstream of ranking.

Output:
A generator yielding ranked clusters → input to Process 7.
------------------------------
Process 7 — Summary Page Generation

This stage consumes the input generator and produces finite views intended
for presentation.

Goal:
Build a continuously updated Summary Page (“Front Page”) containing:

   - Top events per topic
   - Short summaries
   - Links to primary articles
   - “Read similar news” (cluster siblings)
   - Source icons
   - Timestamp of most recent update

The page auto-refreshes and always reflects the newest items.
------------------------------
Process 8 — Detailed Pages & Cross-Links

This stage consumes its input generator and produces finite presentation
views.

For each cluster:

   - Canonical article (primary representative)
   - Related articles across sources
   - Timeline of developments
   - Additional metadata (images, entities, tags)

Cross-links include:

   - “More like this…”
   - “Earlier developments…”
   - “Follow-up stories…”

------------------------------
Notes on the Process Pipeline

   - Feed Fetching typically wraps one or more data providers
   → produces G_rawItems lazily (RSS, JSON APIs, DB cursors, web services)
   - Every stage is expressible as:
      - for-each, filter, append, prepend, insert-at, remove-where, concat,
      or fold, etc., producing a new generator derived from the previous one
   - No stage requires full materialization unless explicitly demanded
   (e.g., to-array, bounded sort, pagination)
   - Infinite generators are valid until stage 6; stages 7–8 typically
   consume finite prefixes (take(n))

------------------------------
Why This Fits the Generator Datatype Extremely Well

   - The pipeline is a composition of generator transformers
   - Each box maps almost 1-to-1 to generator operations
   - External data providers integrate naturally at Stage 1
   - Sorting can be introduced in different ways:
      - External merge-sort over generators
      - Bounded-window ranking
      - Top-K lazy ranking – e.g. using heaps.

------------------------------
Alternative FlowsAlternative Flow 1 — Feed Temporarily Stops Producing New
Items

Condition:
A feed is reachable but has no new items since the last polling cycle.

Flow:

   1. The feed generator advances (move-next()).
   2. The data provider returns no new items.
   3. The feed-generator instance yields no items during this interval.
   4. Downstream generators remain operational.
   5. If all feeds are empty, no new items are added downstream.

Result:
The pipeline continues uninterrupted; no special handling is required.
------------------------------
Alternative Flow 2 — Partial Consumption of the Pipeline

Condition:
Only a finite prefix of the stream is required (e.g., top N items).

Flow:

   1. Downstream consumers apply take(N).
   2. Upstream generators are evaluated only as needed.
   3. Remaining potential yield values are never materialized.

Result:
Latency and memory usage remain bounded. The pipeline supports early
termination naturally.
------------------------------
Alternative Flow 3 — Editor Inserts or Reorders Items

Condition:
An editor manually modifies the aggregated stream.

Flow:

   1. Editor operations are applied as generator transformations
   (append, prepend, insert-at, remove-at, remove-where).
   2. A new generator with the modified yield is produced.
   3. Downstream stages consume it transparently.

Result:
Editorial control integrates seamlessly without breaking the pipeline.
------------------------------
Exception FlowsException Flow 1 — Feed Unreachable or Network Failure

Condition:
A feed cannot be reached during polling.

Flow:

   1. The data provider reports an error or timeout.
   2. The next instance of the feed generator yields no items during this
   polling interval.
   3. The error is logged for monitoring.
   4. A retry policy (e.g., exponential backoff) is applied.

Result:
The system continues operating with remaining feeds.
------------------------------
Exception Flow 2 — Malformed Feed Data

Condition:
A feed item is malformed (invalid XML/JSON or schema validation problems,
e.g. missing required fields).

Flow:

   1. The normalization stage detects the issue.
   2. The item is discarded or quarantined.
   3. Processing continues with subsequent items.

Result:
Malformed data does not propagate downstream.
------------------------------
Exception Flow 3 — Resource Exhaustion Risk

Condition:
A downstream operation risks exceeding memory limits.

Flow:

   1. Bounded strategies (windowing, top-K selection) are applied.
   2. Full materialization is avoided.
   3. If needed, the operation degrades gracefully (e.g., reduced
   clustering depth).

Result:
System stability is preserved under load.
------------------------------
Postconditions

Upon successful execution:
Functional Outcomes

   - End users see an up-to-date Summary Page.
   - Each summary item links to a Detailed Page.
   - Editors can intervene using generator operations.
   - Administrators retain full system control.

Technical Guarantees

   - Memory usage remains bounded.
   - Latency is minimized through lazy evaluation.
   - Full materialization occurs only when explicitly requested.

System State

   - All generators remain composable.
   - Generator composition remains valid after alternative and exceptional
   flows.
   - Empty generators correctly represent exhaustion.
   - Infinite yields are supported up to stages that require finiteness.

------------------------------
References

   1.

   RSS 2.0 Specification
   https://www.rssboard.org/rss-specification
   2.

   Atom Publishing Protocol (RFC 5023)
   https://www.rfc-editor.org/rfc/rfc5023
   3.

   JSON-LD Specification
   https://json-ld.org/spec/
   4.

   TF-IDF, “Understanding TF-IDF (Term Frequency-Inverse Document
   Frequency)”,

   https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/
   5.

   TF-IDF + SVM, “Strengthening Fake News Detection: Leveraging SVM and
   Sophisticated Text Vectorization Techniques. Defying BERT?”,
   https://arxiv.org/html/2411.12703v1
   6.

   Sentence-BERT (SBERT)
   Reimers, N. & Gurevych, I., 2019
   https://arxiv.org/abs/1908.10084
   7.

   Fine-tuned BERT, “Fine-tuning a BERT model”,
   https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert
   8.

   Ada Embeddings (OpenAI)
   Radford et al., 2021
   https://arxiv.org/abs/2103.00020
   9.

   Cosine Similarity
   https://en.wikipedia.org/wiki/Cosine_similarity
   10.

   Hierarchical Clustering
   https://en.wikipedia.org/wiki/Hierarchical_clustering
   11.

   DBSCAN
   Ester et al., 1996
   https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf



On Mon, Jan 12, 2026 at 5:11 PM Liam R. E. Quin <liam@fromoldbooks.org>
wrote:

> On Mon, 12 Jan 2026 11:21:03 +0000
> Norm Tovey-Walsh <norm@saxonica.com> wrote:
>
> >
> > 2.1. PR #2350: 708 An alternative proposal for generators
> >
> >    See PR [37]#2350.
> >
>
> Thanks, i’ll try to be there, and i also enclose the beginnings of some
> use cases (in case anyone has time to read them, sorry for short notice
> here)
>
> For me, it’s not that there is anything that can’t be done without
> generators, since Dimitre has implemented them in qt4 for BaseX
> already. And it might be that a much smaller, core, proposal is enough.
>
> You can also implement regular expressions yourself in XPath, as an
> extreme example, using codepoint and string manipulation, but you
> seriously don’t want to.
>
> Here, generators are small, and are a common paradigm. I don’t want to
> overstate the case for them - an alternative might be to consider an
> equivalent to xsl:iterate for XPath (and hence XQuery): a function form
> that’s guaranteed to be optimized into a loop. But that’s lower level.
>
> If we had a standard paradigm for writing generators, they would be
> used for the random number generator generator. Its exactly the same
> idea: you call random-number-generator($seed?) and you get back a
> generator. You then use that generator repeatedly to get random numbers.
>
> Generators are useful to abstract that idea; some examples:
>
> 1. Mutual Recursion
>
> Mutual recursion is common e.g. in recursive descent parsers, where
> multiple functions need to operate on the next token in the input,
> possibly hiding things like macro expansion (e.g. parameter entites)
> from the parser. So you make a get-next-token generator whose next()
> function can go off into another file when needed.
>
> 2. Avoiding computation
>
> Example: Take two documents with roughly the same content but not in
> the same order, and use text similarity to match paragraphs.
>
> Here, the similarity function is expensive to compute, but a simple
> approximation is enough to classify paragraphs into obviously
> dissimilar and possibly similar. So you want a next-most-likely
> function, but you don't want to compute similarity for the whole
> sequence in advance.
>
> 3. N-way merge
>
> Example: make a sequence of up to _k_ sections from multiple documents
> based on section title or keyword relevance; each document can contain
> multiple sections, and each section must be included only once, based
> on the topic sequence in the original.
>
> The usual way to do this in “QT” is with a recursive function or
> template that chooses a single element at each level, but having a
> generator that can return several elements when appropriate can
> considerably simplify the logic.
>
> Of course, you can write this with helper functions (and i do, today).
> None of this is about things you can’t do today, but only about things
> that are tricky to get right, and bringing in familiar ideas from other
> languages.
>
> A fold-left-while, by analogy with take-while, might be a useful
> addition for solving complex problems, but that goes back to
> xsl:iterate too.
>
> hope this helps,
>
> liam
>
> --
> Liam Quin: Delightful Computing - Training and Consultancy in
> XSLT / XML Markup / Typography / CSS / Accessibility / and more...
> Outreach for the GNU Image Manipulation Program
> Vintage art digital files - fromoldbooks.org
>
>

-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
To achieve the impossible dream, try going to sleep.
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they write
all patents, too? :)
-------------------------------------
Sanity is madness put to good use.
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.

Received on Tuesday, 13 January 2026 08:13:19 UTC