01Three surfaces, one product
HMM Trade has three deployment surfaces working together. Two we operate; one runs on the user's machine for free-tier only. Each surface has a clear boundary so a failure in one doesn't cascade into the others.
- Cloud control plane — Next.js app + Supabase Postgres + Stripe. Handles signup, billing, broker connections, bot lifecycle endpoints.
- Per-bot Fly Machines — a lightweight VM per paid bot, running the same Python trader code as the local agent.
- Local agent— Streamlit dashboard + multi-bot supervisor that runs on the user's laptop. Free tier's primary surface; paid users can run it side-by-side for richer visualisation.
02Cloud control plane (this app)
03Per-bot Fly Machines (paid tiers)
Every hosted bot runs as its own lightweight VM on Fly Machines. Image is registry.fly.io/regime-trader-bot:latest (Python 3.12, ~250MB), 1GB RAM, scale-to-zero between ticks for stock-only bots.
The container does:
- Boots and reads its
BOT_ID+ machine JWT. - Calls
GET /api/v1/internal/bots/<id>to fetchprofile_json+ connection metadata. - Calls
POST /api/v1/internal/broker_tokenwhich decrypts the broker creds via KMS and returns a short-lived access token. - Materializes
config/instances/<id>.yaml+.envon disk so the existingmain.py paperstartup path reads them like a local install. - Spawns the trader subprocess. Output is teed to Fly logs + the cloud audit sink so /bots/<id> mirrors what
fly logsshows. - On SIGTERM (Fly maintenance, Stripe cancel, manual stop): forward to the trader subprocess, flush audit, exit.
04Free tier (hosted, throttled, time-limited)
Free is a 30-day trial of the same hosted-compute path that paid tiers run on, just throttled. Same Fly Machine image, same broker integration, same regime model. The differences are knobs we stamp on the bot row at create time:
tick_interval_seconds = 3600— the trader subprocess sleeps 1 hour between cycles instead of 1 minute. 60× cost reduction vs paid; “60× faster signals” is the upgrade hook.last_active_at+idle_suspend_after_days = 7— a daily Vercel cron stops the Fly Machine for free bots whose owner hasn't visited the dashboard in a week. State is preserved; the user clicks Resume to bring it back.trial_ends_at = created_at + 30 days— same cron deletes the bot row + Fly Machine after the trial window unless the user has upgraded. Three warning emails fire at 7 / 3 / 1 days remaining.
When a free user upgrades to Pro / Live, the Stripe webhook and the reconcile path both call syncTrialEndsAtForUser() to clear trial_ends_aton the user's bots so the cron stops considering them. On downgrade-back-to-free we re-stamp a fresh 30-day grace window — never silently delete a paying-then-canceled customer's bot the moment the subscription lapses.
05KMS envelope encryption
Broker credentials are stored in Postgres but encrypted with a per-payload AES-256 data key. The data key itself is encrypted under a Cloud KMS key — neither the database nor our application code can decrypt without a live KMS round-trip. Compromising the DB alone doesn't leak tokens.
Encrypt path (broker connect):
- Generate a random AES-256 data key in process memory.
- Call
KMS.Encrypton the data key → ciphertext data key. - AES-GCM encrypt the credential JSON with the plaintext data key → ciphertext + IV + auth tag.
- Persist (ciphertext, ciphertext data key, IV, auth tag, KMS key id) on
broker_connections. Zero the plaintext data key.
Decrypt path (bot launch):
- Read the ciphertext + ciphertext data key from
broker_connections. - Call
KMS.Decrypton the ciphertext data key → plaintext data key. - AES-GCM decrypt the credential JSON. Pass to the bot over TLS. Plaintext lives in API process memory only for the duration of one HTTPS request.
06Failure modes by tier
- Free + cloud down — agent keeps trading on cached JWT for 24h, then asks user to reconnect.
- Free + user laptop off— bot stops. Free tier doesn't promise 24/7.
- Paid + control plane down— hosted bots keep ticking (Fly Machines don't depend on our control plane to tick). User can't change settings or see the dashboard.
- Paid + Fly outage — our incident. Status page, refund per ToS.
- Paid + KMS down — bots already running keep ticking (token decrypted in memory). New bot launches fail until KMS recovers. Bounded blast radius.
07Model catalog + admin training pipeline
Models are admin-trained, versioned, and stored in Supabase Storage with metadata in model_catalog. Each row is a (family, version) pair: family is the asset_class + timeframe slot (e.g. hmm-stocks-top100-daily); version is a UTC ISO timestamp from the publish run. Bots subscribe per-asset-class via bots.model_prefs — a JSONB map of {stock: {family, version}, crypto: {...}}. The bot's model_puller picks up new versions on its hourly poll, hot-swaps the in-memory HMM, no restart needed.
Training happens in a one-off Fly Machine in the dedicated rt-trainerapp — separate from per-bot Machines so trainers can't starve live-trading containers. The Cloud's POST /api/v1/admin/models/train route spawns the machine with python -m admin_trainer publish --family X --data-source yfinance (or alpaca), auto_destroy: true. The trainer fetches bars, fits HMMs across n_states ∈ [3..7] with 4 random inits each, picks lowest-BIC, runs validation gates, uploads .pkl to Storage, INSERTs the catalog row, exits. ~3-15 min depending on universe size.
Live training log streams via public.training_events — a Supabase table the trainer POSTs structured events to via PostgREST as it works. Stages: starting → fetching_bars → fitting → validating → publishing → done (or failed). The cloud's training UI polls GET /api/v1/admin/models/train/eventsevery 3s for live progress. RLS denies all reads except service-role; access happens through the cloud's admin/trainer-gated endpoint.
Roles. Privilege ladder is user < tester < trainer < admin. tester can run backtests on /admin/backtest but not train. trainer adds train + publish on /admin/models. admin adds user management on /admin/users. Each role inherits everything below it via the require* helpers in lib/auth.ts (requireTester, requireTrainer, requireAdmin). Granted via the four-segment pill on /admin/users.
07bBacktest console
/admin/backtest lets tester+ users run walk-forward backtests against any published HMM family. The form accepts symbols + date range + family + target allocation; submission inserts a row into backtest_runs with status=queued. The Python admin_backtester worker (mirroring the trainer's shape — Fly Machine with auto-destroy) picks up queued rows, calls into backtest/backtester.py, and streams progress to backtest_events.
Same event-stream pattern as training: starting → fetching_bars → fitting → simulating → computing_stats → done (or failed). The run-detail page (/admin/backtest/[id]) polls events every 2s while live and renders results (equity curve, Sharpe, max-drawdown, regime timeline, per-trade log) when status flips to done. Results are persisted as a JSONB blob on the backtest_runs row so old runs stay inspectable indefinitely.
Why a separate role from trainer? Backtests don't mutate the model catalog — they're read-only with respect to published artifacts. Splitting tester from trainer lets a researcher iterate on backtest configurations without inheriting the right to publish new model versions to production.
08Transactional email (Resend)
Welcome and tier-upgrade emails ship via Resend with React Email templates. Domain-verified (DKIM/SPF/DMARC) on hmmtrade.com. Triggers:
- Welcome — auth callback (
/auth/callback) on first sign-in. occasion key =once. - Pro / Live / Enterprise upgrade — Stripe webhook (
/api/v1/stripe/webhook) oncheckout.session.completed+customer.subscription.updatedwhen status is active or trialing. occasion key =stripe_subscription_idso multi-period renewals don't re-fire; cancel + re-subscribe does.
Idempotency is two-layered: public.email_log (unique on (user_id, template, occasion_key)) for our DB-side record, and a deterministic Idempotency-Key passed to Resend so racing webhook events dedupe at the provider layer too. Stripe fires three events per upgrade and Vercel can run handlers concurrently — without provider-layer dedup, that race produced 2-3 emails per upgrade in practice.
09Theme system (light + dark)
Tokens in app/globals.css are RGB triplets (e.g. --color-bg: 8 9 15) so Tailwind's <alpha-value> substitution still works (bg-panel/40 → rgb(...)/0.4). Two themes: default (no class) is dark; html.light is the inverted light palette. ThemeProvider in components/theme/ manages the {system | dark | light} setting, persists to localStorage, and tracks prefers-color-scheme live for system mode.
An inline boot script in <head> sets the class on <html>before the body paints, eliminating FOUC. Toggle is the three-segment pill in the top nav. Email templates ship dark-themed regardless of recipient's OS preference; consistency of the brand identity wins over context-matching there.
Brand mark lives in public/logo/as SVG — the regime-stack composition (three offset bars in the brand palette on a dark panel) directly mirrors the dashboard's regime ribbon. Next.js auto-detects app/icon.svg as the favicon and app/apple-icon.svg as the iOS home-screen icon. The mark also surfaces in the email header (rendered as nested colored divs since most email clients strip inline SVG).