← All case studies
01 · EDITORIAL AUTOMATION · SHIPPED 2025 · 6 min read

A newsroom that files itself

In a regulated vertical with high source density and a tiny attentive audience, a deterministic pipeline now files twenty-four articles a day without a human touching a single one.

~720articles per month
24/daypublishing cadence
99.4%pipeline uptime
0human edits per piece
01 · CONTEXT

01 · CONTEXT

The vertical is regulated, technical, and unusually dense with primary sources. Hundreds of public feeds publish daily, most of them repeating each other with minor variations. The audience is small, sophisticated, and chronically behind. They want signal, not another inbox.

A newsroom staffed to cover that surface area honestly would need a dozen editors working in shifts. The economics don't support that headcount, and never will. The system was built to match the coverage a small editorial team would produce, without the team.

02 · WHY

02 · WHY IT WAS WORTH BUILDING

Manual triage in a feed-heavy vertical is the worst kind of work. It is repetitive, time-sensitive, and mostly negative space, since the majority of incoming items are duplicates, off-topic, or low-value. The operator wanted the attentive audience served on a fixed cadence, which meant removing the human from the per-piece path.

The decision was not to build an assistant that speeds up editors. It was to build a publisher that runs without them, and to hold it to a standard the audience would accept on its own merits. That framing shaped every downstream choice.

03 · HOW

03 · HOW IT WORKS

Source feeds are polled on a schedule and normalized into a single item shape. A taxonomy router classifies each item against the vertical's topic map and drops anything outside it. A vector deduplication stage compares incoming items against a rolling window of recent publications and collapses near-duplicates before any language model is invoked, which keeps cost and noise predictable.

Surviving items are enriched in a single pass: a structured summary, metadata fields the CMS expects, and a headline shaped to the audience's conventions. Source images are fetched and rehosted to the publisher's own storage, both for legal hygiene and because upstream hosts are unreliable under repeated access. The finished record is posted to a headless CMS through its admin API. Cron orchestrates the whole chain on fixed intervals.

PUBLISHER LOG · LIVE 04:00:12 router scanned 312 items · kept 47 · dropped 26504:00:48 dedupe 47 candidates · unique 11 · collapsed 3604:01:39 enrich 11 items · gpt-4o ok · 0 retries04:02:21 rehost 11 items · 34 images · 0 failures04:02:57 publish 11 posts queued · cms ack 11/1104:03:04 cycle done 00:02:52 elapsed · next run 05:00:00
04 · SYSTEM

04 · WHAT THE SYSTEM ACTUALLY IS

At rest, the publisher is a small Python service, a vector index scoped to the recent publishing window, and a set of cron entries. There is no queue broker, no orchestration framework, no dashboard for editors to babysit. Each stage writes a structured log line, and the logs are the interface.

The CMS is headless and treated as a dumb target. The publisher owns the taxonomy, the deduplication state, and the image store. If the CMS were replaced tomorrow, the rest of the pipeline would not notice. That separation was deliberate and it is the reason the system has survived two upstream changes without an incident.

Failure modes are narrow and named. A source feed going dark reduces coverage but does not stall the cycle. A model timeout drops the item and logs it, rather than retrying into a cost spiral. The rehost step is the only stage allowed to block, and it blocks only on its own budget.

A small editorial team's worth of manual triage, replaced by a deterministic pipeline that files on a schedule and does not ask for supervision. — on what the system replaced
05 · RESULTS

What it produces

The publisher files roughly seven hundred and twenty articles a month at a steady twenty-four-per-day cadence. Pipeline uptime sits at 99.4% measured over the last full quarter, with the remaining time accounted for by scheduled CMS maintenance rather than publisher faults. Average human edits per published piece is zero.

The more interesting result is what the operator stopped doing. The daily triage loop is gone. The audience receives coverage on a fixed rhythm they can plan around, and the cost of serving them does not scale with the volume of incoming noise. That is the point of the system, and it is the standard any future addition has to meet.

06 · STACK
Python publisherHeadless CMSGPT-4oVector dedupecron

A Python publisher, a vector index, a headless CMS, and cron. Nothing else earns its keep.

NEXT CASE STUDY

A Telegram-in, newsroom-out editorial bot

Read next →

If this maps to a system you need built or fixed — tell me about it.

WhatsApp → Telegram → Email →