← All posts
Deep divehmmbicmodel selection

Why 7 regimes? How BIC picks the model so you don’t overfit

More states explain training data better. They also overfit. The Bayesian Information Criterion is a classical compromise — here’s what it does and why we trust its answer.

Alok Desai··9 min read

When we ship an HMM model in HMM Trade's catalog, the regime page shows things like “7 states · BIC 204,407.” New users ask reasonable questions: why seven? Why not five? Why not fifteen? The answer is that we don't pick; an objective scoring function does. This post explains what that function is, why it matters more than it sounds, and why the same approach has been used in statistics since 1978.

The overfitting problem in two lines

Suppose you fit an HMM with 3 states to a year of bar data. It explains the data with some likelihood — call itL₃. You add a 4th state and refit. Likelihood goes up: L₄ > L₃. Inevitably. More parameters means more flexibility means a tighter fit.

Push this further. By the time you've got 15 states the model can wiggle to match almost every cluster of bars in training, including the noise. The likelihood is gorgeous. The model is also useless on next month's data because it memorized the past instead of learning the structure.

BIC: penalizing complexity

The Bayesian Information Criterion, proposed by Gideon Schwarz in 1978, is a famously simple compromise:

BIC = k · log(n) − 2 · log(L)

  k = number of free parameters in the model
  n = number of training observations
  L = the maximum-likelihood value

Lower is better. The first term punishes you for parameter count, scaled by log(n). The second rewards you for fit. Adding a parameter improves L (lowers BIC) only if the marginal fit-improvement exceeds the marginal complexity cost. At some K, the trade flips and BIC starts going up again. That's your inflection point.

For our crypto-1h universe, with 8500 training observations and ~125 parameters per state, the curve looked like this on the publish run we shipped:

n_states=3
224,098
BIC
n_states=4
214,708
n_states=5
210,571
n_states=6
206,713
n_states=7
204,407
✓ winner
n_states=8
(degraded)

Each step from 3 → 7 the BIC dropped: the marginal fit gain was worth the marginal complexity cost. At 8, the fit gain was no longer enough — BIC started going up. Seven is the sweet spot for this data.

Why log(n) and not just n?

BIC's complexity penalty grows as log(n), not linearly. The intuition: when you have very little data, you should be conservative about parameters. When you have a lot, the data can “support” more parameters reliably. Log scaling means a 10x increase in data only doubles the tolerable model complexity, which matches statistical intuition about how fast you can identify ever-finer structure.

There's a related criterion, AIC (Akaike), that uses 2·k instead of k·log(n). AIC is less aggressive at penalizing complexity and tends to pick bigger models. For finance with finite data and out-of- sample concerns, we lean BIC. Both are defensible; pick one and be consistent.

What “parameter count” means for an HMM

For a Gaussian HMM with K states and F features:

  • K means per state: K × F parameters.
  • K covariance matrices (full-cov): K × F × (F+1)/2 parameters.
  • K-1 startprob entries (sums to 1, so K-1 free).
  • K × (K-1) transition entries (each row sums to 1).

For our 7-state model with ~15 features, that's approximately 7×15 + 7×120 + 6 + 42 ≈ 990 free parameters once you count the covariance matrix entries. Sounds like a lot — but with 8500 observations, BIC says it's fine.

If you reduced features (say, just returns + 5d realized vol — F=2), the parameter count drops dramatically and the BIC curve might select a different K. The optimal K is a function of the feature space, not a constant of physics.

The cross-validation alternative we don't use

For most ML problems we'd say “just do k-fold cross-validation.” For HMMs on sequential financial data, that's problematic:

  • Naive CV breaks time order — you'd train on future bars and test on past, which is leakage.
  • Walk-forward CV (the time-respecting version) is expensive: each fold needs a full HMM refit, which is minutes of compute.
  • And empirically: BIC and walk-forward CV usually agree on the same K for these models, so the cheaper proxy wins.

We do use walk-forward backtesting downstream (to evaluate strategies, not the HMM itself). But for picking K, BIC is the right tool.

Things BIC doesn't tell you

BIC is a model-selection criterion. It picks the best K among the candidates you tried. It does NOT tell you:

  • Whether the model is good in absolute terms. A BIC of 204,000 is meaningless without a comparison. It's only useful relative to other models on the same data.
  • Whether the regimes are interpretable. BIC just measures fit + complexity. The fit might be high because the model picked up on a single anomalous period and assigned it a private state. Visual checks (plot the regime ribbon, check the transition matrix) catch this.
  • Whether your features are right.BIC can only pick the best K for the features you gave it. If your features can't distinguish between regimes that matter, no K will fix that.

Our practical recipe

When the admin trainer publishes a new model:

  1. Fit candidates n_states ∈ [3, 4, 5, 6, 7], each from 4 random initializations to mitigate local-minima.
  2. Pick the lowest-BIC fit across all candidates.
  3. Run validation gates — degenerate states, finite log-likelihood, label coverage.
  4. If gates pass, publish. If not, the trainer refuses with a clear error.

These validation gates are the safety net BIC doesn't provide. BIC says “7 is the best K”; the gates ask “but is the 7-state fit actually any good?” Both checks have to pass before the model lands in the catalog and the fleet's bots pull it.

A philosophical postscript

Information criteria like BIC are an underrated bit of statistical engineering. They give you a principled way to say “more parameters help, but only this much” — and they do it without needing held-out data, without iterating, and with a closed-form expression. For a problem where each model fit is expensive and data is precious, that's a great deal.

We don't pick K because we like the number 7. We pick K because the data, scored by a criterion that's been battle-tested across forty-seven years of statistical practice, says K=7. The next time we retrain, K might change. That's fine. The principle stays.

Try the bot

Run a paper bot in 5 minutes. Free tier, your laptop, no card required.

Start free →