An outbound system that grew up in production.

It started as a no-code stack — a chain of Make.com scenarios, a ProtonMail sender, and a spreadsheet of leads. It was working, in the sense that emails were going out. It was unmaintainable in every other sense. The rebuild target was deliberate: one Python codebase, one cron schedule, one Airtable base, one transactional sender, an audit trail per lead. Months in, that's what runs. Two to five hundred personalized cold emails a day, drafted, sent, sequenced, and reconciled, without an operator at the keyboard between the schedule and the recipient. The route from the first to the second is the case study.

What was inherited

The original system was a sequence of Make.com scenarios — about a dozen of them — wired through Google Sheets, a transactional email API, and a small set of webhooks. Each scenario did one thing: pull a row, enrich it, draft a message, send the message, watch for replies, log a result. They worked. They had been working for the campaign for some time. Nothing about them was reproducible.

There was no per-lead audit trail beyond what each scenario chose to write back to the sheet. There was no A/B test infrastructure, no sequence engine for follow-ups, no observable error path beyond Make's own log UI. The sender was a personal mailbox routed through a privacy-first provider that, while pleasant for humans, had unhelpful relationships with corporate inbox filters. Changes meant clicking through five different vendor UIs to confirm a single edit had landed. The bus factor was one operator and a memory of which scenario did what.

Why migrate

The decision was not a forced rebuild. The Make stack was good enough to keep sending. The choice was about what kind of system the operator wanted to be running for the next year. A no-code stack scales until the day it doesn't, and the day it doesn't is usually the day a recipient asks a question the operator can't answer from the sheet alone — why did you email me, when, with what, and what did you say.

The migration target was a small list of properties, written down before any code: a single source of truth for leads, a single language to operate in, a single scheduler, a single transactional sender with a documented warming history, a per-lead audit log of every state transition, an A/B-test slot, a sequence engine, and an alerting path that would tell the operator something was wrong before he noticed nothing was happening. Everything else was negotiable. Those properties were not.

How it works now

Airtable holds the lead state. Every record carries the company, the URL, the contact, the email, the LinkedIn URL, the variant assignment, the sequence step, the send timestamps, the message IDs, and a notes field that accumulates an audit trail in append-only fashion. The view used by the pipeline filters out hard bounces, unsubscribes, do-not-contact flags, and over-contacted records. The view is the queue.

A small set of Python scripts orchestrates everything. One drafts new leads — pulls a vendor research blob from a web-grounded model, runs it through a chat model with a Carnegie-style prompt, returns a JSON subject and body, writes both fields back to the record. A second sends drafted leads through a transactional API, capped at a deliberate hourly volume so reputation stays warm, and records the outbound message ID for later reconciliation. A third runs the follow-up sequence on its own cadence — bumps at six days, breakups at fourteen, with anti-clobber locks to prevent overlapping runs. A fourth pulls send-engine events back into Airtable so opens, clicks, bounces, and unsubscribes all reconcile against the lead record without human intervention. A fifth mails a morning digest summarising the previous day.

All five run on cron on a single VPS. A Telegram bot offers manual override — `/draft N`, `/send N`, `/stats`, `/recent` — for the times when the operator wants to push a batch by hand. Every script wraps in a small shell harness that catches non-zero exits and posts the log tail to Telegram. There is no dashboard. There does not need to be one. Airtable is the dashboard.

What kept it honest in production

Architecture is what you ship. Maturity is what survives the first six months. The system above did not run cleanly out of the box. It ran cleanly after eight specific incidents were caught, diagnosed, and codified into rules that prevented their recurrence. The rules are the actual deliverable.

The follow-up engine fired on a Sunday once because the cron-level weekday guard had been added on the sender but not the sequencer. Twenty-some bumps went out at three in the afternoon to people who had not asked to hear from anyone over a weekend. The fix was not the cron line. The fix was a code-level weekday guard at the top of every script that touches outbound, on the principle that any business rule which spans multiple entry points has to be enforced in the language the scripts are written in, not in the schedule that runs them.

The performance sync started failing every four hours after a deliberate scale-up, because it was polling the send-engine events one message at a time. At small volumes, fine. At the new volume, it ran into the engine's per-tag rate limit, sat in exponential backoff for hours, blew through a stale Airtable pagination token, and tripped a fcntl lock on the next run. The fix was structural — replace per-message polling with a single bulk events query paginated by date, group by message ID in process, and reconcile in one pass. Steady-state load dropped by about an order of magnitude. The lock collisions disappeared. The lesson was that any per-record loop over a rate-limited API is a bomb on a long fuse the moment volume changes.

A research-API quota ran out one evening and the pipeline aborted partway through a batch. Half the leads got drafted, half did not, the sender dutifully tried to send the half that did, and the bounce-style empty subjects told the operator something was wrong only after he checked the next morning. The fix was a small set of error-message markers — `insufficient_quota`, `billing_hard_limit`, `invalid_api_key` — checked on every model call, with a hard system-exit on first match so the wrapping shell harness could surface the failure to Telegram immediately. Silent quota exhaustion is the worst kind of outage, because everything else looks fine.

The follow-up engine was, for a time, ignoring local business hours. The cron schedule was UTC. The recipients were in twenty time zones. The fix was a small country-to-IANA-timezone resolver and a per-lead `in_business_hours()` gate inserted before the LLM draft call, so a follow-up to a Singapore recipient at three in the morning their time would skip silently and re-evaluate on the next hourly tick. Off-hours sends do not just damage open rates; they cost real model tokens that should never have been spent.

Most recently, a recipient replied with a flat denial — "I don't work for that company." The reply turned out, after the same investigation he triggered was pointed at his record, to be a false alarm; he was correctly attributed and his note was scope-deflection rather than a data error. But pulling on it surfaced the broader problem his record represented. The pipeline had never compared the email's domain to the company URL on the record, and across the full list, almost two percent of records were attributed to companies the people in question did not work for. Fifteen dollars of language-model verification, four hundred and fifty-three confirmed-bad records pulled out of the active pipeline, and a three-tier verification gate — deterministic domain comparison, a corpus-frequency allowlist of corporate alt-domains, an LLM fallback for the residual — that now runs on every new lead before any draft is generated. The cost of verifying the input was small. The cost of sending into a wrong-attributed list was the part the operator did not want to keep paying.

Five cron jobs, no dashboard, and an audit trail per lead

The system runs unattended on weekdays. The operator opens it on incidents, on weekly review, and when a manual override would be faster than waiting for the next scheduled tick. Two to five hundred personalized cold emails go out per day across the business window, sent only to recipients in their own local hours, drafted only against records whose attribution has been verified, sequenced only when prior steps have actually delivered, and reconciled against the send engine's events in near-real time. Hard bounces sit at zero point seven percent on a deliberately-curated list. Unsubscribes are effectively zero. Reputation has held since the migration.

The transferable substance of the project is not any one of the components above. It is the discipline of writing every business rule into the code that enforces it, of treating per-record loops over rate-limited APIs as failure-prone by default, of failing loudly on quota exhaustion rather than partially-completing, of verifying the input before spending the budget on personalising it, and of letting the audit trail in the lead record do the work that a separate observability tool would otherwise demand. None of these are clever. All of them are the difference between a stack that runs for a week and a stack that runs for a year.

The migration delivered what the property list at the top demanded: a single source of truth, a single language, a single scheduler, a single sender, a per-lead audit log, an A/B slot, a sequence engine, and an alert path. It also delivered the eight incidents above as the actual artifact of running it for real. The system has not stopped maturing. It has stopped surprising the operator.