Why Backtests Mislead: CAGR, Expectancy, Omega Ratio, and SQN

Building the algorithm is the fun part. Getting the evaluation right is where things get genuinely hard — and where most capital gets quietly lost. This is exactly where production trading systems architecture diverges from backtest theater.

Not because the math is wrong. Because the metrics that feel right — the ones built into every backtesting platform, cited in every report, quoted in every “strategy review” — are built on assumptions that don’t hold in practice. They look credible, they produce impressive-sounding numbers, and they lead smart people to confidently deploy systems that slowly destroy capital.

This isn’t a beginner’s guide. It’s a focused breakdown of where standard metrics go wrong and what production quant systems use instead. Every claim has a live calculator attached — test the logic yourself.

1. CAGR vs Arithmetic Return: The Compounding Mistake That Costs Real Money

Ask most developers how their strategy performed: up 50% in year one, down 50% in year two. The obvious answer: “Average return is zero, I broke even.”

Wrong. They lost 25%.

Start with $100. Up 50%: $150. Down 50% from $150: $75. The arithmetic average is 0%, but you’re holding $75. The bigger the swings, the wider the gap gets — and it always goes the same direction: arithmetic overstates reality.

This is volatility drag. The correct measure is CAGR — the single constant rate that produces the same total compounded result:

$\text{CAGR} = \left(\frac{V_{\text{end}}}{V_{\text{start}}}\right)^{1/N} - 1$

$V_{\text{end}}$ — portfolio value at the end of the measurement period
$V_{\text{start}}$ — portfolio value at the start
$N$ — number of periods (years, months — must match the return frequency)

Any backtest reporting “average monthly return” without compounding is flattering itself. The real number is lower, sometimes dramatically so.

Period 1 return (%)

Period 2 return (%)

Arithmetic mean 0.00% What your intuition says

CAGR (geometric) –13.40% What actually happened

Capital evolution starting from $100

Try +100% / −50%. Arithmetic says you averaged +25%. CAGR says +0%. Your money agrees with CAGR.

2. Win Rate vs Expectancy: Why 90% Win Rate Can Bankrupt You

High win rate feels good. Loss aversion is real — Kahneman and Tversky measured it; traders live it. The result is a whole class of overfit systems built around one goal: maximise the number of winning trades. Psychologically satisfying. Financially suicidal.

The math is brutal. A system winning 90% of trades at $10 per win, losing 10% at $200 per loss:

$E = (0.90 \times \$10) - (0.10 \times \$200) = \$9 - \$20 = -\$11 \text{ per trade}$

Mathematical Expectancy is the only metric that tells the truth:

$E = (W \cdot \bar{w}) - (L \cdot \bar{l})$

$W$ — win rate (fraction of trades that close in profit)
$\bar{w}$ — average profit on a winning trade
$L = 1 - W$ — loss rate
$\bar{l}$ — average loss on a losing trade (positive number)

Trend-following systems — some of the most robust in existence — typically win 35–40% of trades with reward-to-risk ratios of 3:1 or 4:1. The math works because rare big wins dwarf frequent small losses. A 30% win rate can easily outperform a 90% win rate strategy if the R:R is right.

Win Rate 90%

Average win $10

Average loss $100

Expectancy per trade –$1.00 Math is working against you.

Projected over 100 trades

Drag win rate to 90%, avg loss to $200, avg win to $10. Then watch. Then flip it: win rate 35%, avg win $300. Different strategy, much better business.

3. The Sharpe Ratio Problem: It Penalises Your Best Trades

Sharpe is the industry standard. Every fund uses it. Most reporting mandates it. It divides excess return by standard deviation of returns — “how much return per unit of volatility?” Above 1.0 is good, above 2.0 is excellent.

The problem is fundamental: standard deviation is symmetric. It penalises deviation above the mean the same as deviation below it.

If your strategy returned +4% when you expected +1%, Sharpe treats that as harm — exactly as it treats a −4% month. The formula cannot distinguish “my strategy had an exceptional month” from “my strategy blew up.” It punishes outliers regardless of sign.

More practically: Sharpe says nothing about how long you stay in a drawdown. A strategy can score Sharpe 1.5 while sitting 25% underwater for 14 months. The standard deviation calculation won’t show that. You will.

Same destination. Strategy B’s Sharpe can look fine if its upside volatility inflates the denominator alongside the downside. Meanwhile, nobody who lived through −38% stayed calm.

The better alternatives:

Calmar Ratio — Annual return ÷ Max Drawdown. Punishes depth of loss directly, not variance of gains. This is what we look at first.
Sortino Ratio — Like Sharpe, but only counts downside deviation. An unexpected +4% month doesn’t hurt your score.
Ulcer Index — Measures both depth and duration of drawdowns. The longer you stay underwater, the worse the score.

We still report Sharpe. We make decisions on Calmar and Sortino.

4. Omega Ratio Explained: The Metric That Knows Your Threshold

Omega Ratio is what Sharpe should have been. Instead of treating all volatility as equally bad, it asks one precise question: given the return you need, does this strategy produce more gain-mass than loss-mass?

You set a threshold $\tau$ called the MAR (Minimum Acceptable Return) — your cost of capital, your benchmark, whatever you actually need to beat. Omega is defined as:

$\Omega(\tau) = \frac{\int_{\tau}^{+\infty}[1 - F(r)]\,dr}{\int_{-\infty}^{\tau} F(r)\,dr}$

$\tau$ (tau) — the MAR threshold: your minimum required return
$F(r)$ — cumulative distribution function of trade returns
Numerator — probability mass of returns above $\tau$ (the “good” area)
Denominator — probability mass of returns below $\tau$ (the “bad” area)
$\Omega > 1$ means more mass lives above your threshold than below

The critical difference from Sharpe: move the MAR and Omega changes. Sharpe is blind to your threshold entirely.

Return distribution — above MAR counts as gain, below counts as loss

MAR threshold 0%

Drag to set your minimum acceptable return

Sharpe Ratio 1.24 Penalises upside equally to downside

Omega Ratio 1.85 Area of gain ÷ area of loss, relative to MAR

Omega > 1 means the strategy produces more gain-mass than loss-mass above your threshold. Sharpe stays fixed regardless of how you move the MAR.

Drag the MAR slider right — toward a higher required return. Omega drops. Sharpe stays frozen. This tells you something real: the strategy might look fine at $\tau = 0\%$ but barely clear your actual capital cost at $\tau = 6\%$ .

Omega is especially valuable for asymmetric return profiles — trend-followers with fat right tails, options strategies, anything where a handful of big wins carry the whole year. Sharpe is blind to that structure. Omega sees it clearly.

5. Van Tharp’s System Quality Number: Statistical Confidence for Your Edge

Expectancy answers “does this edge exist?” SQN answers the harder question: how much should I actually trust this result?

$20 Expectancy on 12 trades is noise. $20 Expectancy on 1,200 trades is a business. Van Tharp built SQN to capture that difference:

$\text{SQN} = \frac{E}{\sigma} \cdot \sqrt{N}$

$E$ — mean expectancy per trade (average P&L per closed trade)
$\sigma$ — standard deviation of individual trade P&L (not of periodic returns — of each trade’s dollar result)
$N$ — total number of trades in the sample
$\sqrt{N}$ — the confidence multiplier: grows only as confirmed evidence accumulates

The $\sqrt{N}$ term is the whole point. It’s a statistical scaling factor that rewards large confirmed samples and penalises small ones. You cannot manufacture a high SQN by finding lucky parameters on 20 trades. The formula won’t let you.

SQN = (Expectancy / σ) × √N

Number of trades (N) 100

Expectancy per trade ($) $20

1.6 – 1.9 Questionable

2.0 – 2.4 Good

2.5 – 2.9 Excellent

3.0+ Holy Grail

2.0 SQN

Good system

More trades = higher confidence in the expectancy estimate. A $20 expected return on 10 trades is luck. On 1,000 trades, it's a machine.

Drag N up with the same expectancy. The score climbs — not because the edge got better, but because you have more evidence that it’s real. That’s the correct behaviour.

Van Tharp’s thresholds:

Below 1.6: don’t trade it — you can’t distinguish edge from luck with this sample
2.0–2.4: good, tradeable with appropriate position sizing
2.5–2.9: excellent — where most serious professional systems operate
3.0+: genuinely rare in live markets; if you’re seeing this, verify your methodology before celebrating

One more thing: each additional free parameter you optimise is a tax on your SQN. A 6-parameter system needs substantially more out-of-sample confirmation than a 2-parameter one. The formula doesn’t account for degrees of freedom — you have to.

6. The Fitness Function: What to Actually Optimise

When you set an optimiser loose on a strategy, what objective do you give it?

Not net profit. Never net profit. An optimiser handed net profit will find parameters that latch onto two or three lucky trades in the historical data, produce a spectacular-looking backtest, and promptly fail in live markets when those specific conditions don’t repeat. Every time.

The right answer is a composite fitness function that rewards multiple dimensions simultaneously:

$\text{Fitness} = \frac{\text{PF} \times \text{SQN}}{\text{MaxDD\%}}$

$\text{PF}$ — Profit Factor: sum of all winning trades ÷ sum of all losing trades
$\text{SQN}$ — System Quality Number (as above)
$\text{MaxDD\%}$ — Maximum Drawdown as a percentage of equity peak

Each component closes a different loophole: PF resists lucky outliers, SQN requires statistical confirmation, MaxDD% penalises catastrophic paths.

Profit Factor ( $\Sigma$ gross wins ÷ $\Sigma$ gross losses) is resistant to outliers in a way that net profit isn’t. One massive lucky trade barely moves Profit Factor. It inflates net profit enormously. Above 1.5 consistently means the system earns more than it loses in aggregate. Below 1.0, the system is structurally losing.

Recovery Factor (total net profit ÷ max drawdown) tells you how many times the system has “worked off” its own worst loss. A value above 3.0 on a meaningful sample means real resilience — the system didn’t just recover once, it did it repeatedly.

Max Drawdown % in the denominator forces the optimiser to prefer a slightly less profitable path with shallower losses. Without it, optimisers will happily trade any amount of drawdown for marginal profit gains.

At this point, those five bullet points stop being “advice” and become operating policy. Here is the same logic as an executable playbook:

Quick navigation — 5 operating rules

Rule 1: Composite objective, not net profit
Rule 2: Monte Carlo before forward test
Rule 3: Hard gate at SQN 2.0
Rule 4: Report Sharpe, decide by Sortino/Omega
Rule 5: Count free parameters and overfitting tax

Practical Playbook

5 rules that turn metric theory into robust strategy selection

Each rule maps to a common optimiser failure mode and shows what to do instead.

Rule 1

Optimise a composite objective, never raw net profit

Profit-only optimisation overfits to a few historical outliers. Composite scoring keeps path quality, consistency, and downside risk in the loop.

Rule 2

Run Monte Carlo before forward-testing

A single backtest is one path. Monte Carlo permutations expose the drawdown envelope you should expect in live conditions.

Rule 3

Hard gate at SQN 2.0 before forward allocation

Small samples are noisy. SQN grows with evidence; confidence bands shrink as trade count rises. Gate capital by statistical confidence, not excitement.

Rule 4

Report Sharpe, decide by Sortino and Omega

Two strategies can have similar Sharpe and radically different downside behaviour. Sortino and Omega keep focus on the losses that actually hurt you.

Strategy A

Sharpe 1.42 · Sortino 1.08 · Max DD -28%

Strategy B

Sharpe 1.44 · Sortino 2.31 · Max DD -11%

Rule 5

Count free parameters and charge an overfitting tax

In-sample scores usually rise with complexity. Out-of-sample quality peaks early and then degrades. More parameters demand materially stronger confirmation.

When those five rules are enforced together, the optimiser shifts from “best historical score” to “most likely to survive contact with reality.” That shift is what turns a pretty backtest into a strategy worth allocating real capital to.

10-minute strategy review checklist

Minute 1-2: Recompute CAGR vs arithmetic return. If the gap is wide, your volatility drag is already warning you.
Minute 3: Check Expectancy, not win rate. Any negative expectancy is an instant reject.
Minute 4-5: Replace Sharpe-only judgment with Sortino/Omega + max drawdown depth and duration.
Minute 6-7: Enforce SQN gate. Below 2.0: no forward capital, no exceptions.
Minute 8-10: Count free parameters and run Monte Carlo to estimate worst realistic drawdown before live deployment.

7. How This Works in Production Algo Trading Systems

These metrics aren’t just theoretical. In the trading systems we build — MTRobot and Steve Trading Bot — they’re embedded in how strategy selection, position sizing, and live monitoring actually work.

In Steve Trading Bot, every strategy passes through a Profit Factor and Expectancy gate before live allocation. A strategy that clears the Sharpe threshold but fails the Expectancy check doesn’t get deployed. The check is automatic, not discretionary.

In MTRobot, users run multiple strategies across multiple broker accounts simultaneously. Position sizing per strategy is calibrated partly on its SQN score — higher statistical confidence, larger allocation. A strategy with SQN below 1.6 gets flagged regardless of headline return figures, because the signal-to-noise ratio isn’t there yet.

Live monitoring goes beyond alerts. When a strategy’s Calmar Ratio degrades past a threshold — returns stagnate while drawdowns grow — the system halts new position opens automatically. Not an alert that gets ignored; an actual halt. This is the operational form of the metrics above.

For more on the engineering behind production trading infrastructure, see Trading Systems & Platforms.

FAQ

What is CAGR and why does it matter for backtesting? CAGR (Compound Annual Growth Rate) is the actual compounded annual return of a strategy. Unlike arithmetic average return, it accounts for the compounding effect of sequential gains and losses. Arithmetic averaging systematically overstates performance whenever returns vary across periods — which they always do. Use CAGR for any multi-period performance comparison.

What is a good Mathematical Expectancy for a trading strategy? Any positive Expectancy is a necessary (but not sufficient) condition for viability. What matters more is Expectancy confirmed by sufficient trade volume — measured by SQN. A $5 Expectancy on 10 trades per day beats a $50 Expectancy on 1 trade per week at the same account size. Normalise by frequency and capital deployed, and confirm with SQN.

Why is win rate a misleading metric in algo trading? Win rate measures frequency of wins, not magnitude. A 95% win rate with average loss 20× the average win will reliably lose money. The complete metric is Expectancy: $(W \cdot \bar{w}) - (L \cdot \bar{l})$ . Win rate is only meaningful alongside R:R ratio. In isolation, it tells you almost nothing about long-term profitability.

What are the main problems with the Sharpe Ratio? Three main issues: (1) It treats upside and downside volatility symmetrically — exceptional profits hurt your Sharpe as much as exceptional losses. (2) It says nothing about drawdown duration — you can have an acceptable Sharpe while sitting 30% underwater for 18 months. (3) It’s benchmark-agnostic — it can’t tell you whether the strategy clears your actual required return. Sortino fixes issue one; Calmar and Ulcer Index fix issue two; Omega Ratio addresses all three.

What is a good SQN score for a trading system? Van Tharp’s scale: below 1.6 is statistically insufficient to trade; 2.0–2.4 is good; 2.5–2.9 is excellent; 3.0+ is extremely rare in live markets. Treat anything above 3.0 with curiosity — verify your data quality and methodology (look-ahead bias, survivorship bias) before concluding you have exceptional edge.

What is the Omega Ratio and how does it improve on Sharpe? Omega measures the ratio of probability mass above your minimum acceptable return ( $\tau$ ) to mass below it. Unlike Sharpe, it (1) treats upside and downside asymmetrically, (2) responds to your actual return threshold, and (3) captures the full shape of the return distribution rather than just its standard deviation. Particularly valuable for asymmetric strategies — trend-followers, options, or any system where a few large wins dominate.