Production ML Failure Modes
Here’s an uncomfortable truth: most ML failures in production are not model problems. They are system problems. The model that achieved 95% accuracy in your notebook will silently degrade to 60% in production — and you might not notice for months.
Understanding these failure modes is essential for anyone building production AI systems.
The five failure modes that kill ML projects
1. Data quality degradation
Your model was trained on clean, curated data. Production data is messy, incomplete, and constantly changing.
Common symptoms:
- Missing fields that were always present in training
- New categories appearing in categorical features
- Numeric distributions shifting outside training ranges
- Upstream system changes breaking data contracts
Real example: A CV pipeline collapsing in the field because camera firmware updates changed image preprocessing — see AgrigateVision for how we addressed this.
2. Pipeline drift
The silent killer. Your data pipelines change, your feature engineering evolves, but your model expects the old format.
Warning signs:
- Feature values fall outside expected ranges
- Null rates increase unexpectedly
- Processing latency spikes without model changes
- A/B tests show inconsistent results
3. Monitoring gaps
You can’t fix what you can’t see. Most teams monitor infrastructure (CPU, memory, latency) but not model health.
What to monitor:
- Input distribution: Are production inputs similar to training?
- Prediction distribution: Is the model outputting expected patterns?
- Performance metrics: Accuracy, precision, recall on live data
- Business metrics: The actual outcomes you care about
4. Ownership ambiguity
When the model breaks at 2 AM, who gets paged? ML systems span data engineering, ML engineering, and platform teams. Without clear ownership, issues fall through cracks.
Questions to answer:
- Who owns the training pipeline?
- Who owns the serving infrastructure?
- Who owns data quality upstream?
- Who is accountable for model performance?
5. Training-serving skew
The model sees different data in production than it did during training. This is surprisingly common and notoriously hard to debug.
Sources of skew:
- Different preprocessing code paths
- Stale features from slow-updating stores
- Time-based features with timezone bugs
- Different library versions in training vs. serving
Domain-specific failure patterns
Different domains have unique failure modes:
Computer Vision
CV pipelines fail when input conditions change:
- Lighting variations not seen in training
- Camera angle or lens changes
- Object occlusion patterns
- Seasonal appearance changes (agriculture, outdoor)
See Computer Vision in Applied AI for our approach to building robust CV systems.
Trading Systems
Trading systems fail spectacularly under market stress:
- Models trained on calm markets break during volatility
- Latency spikes cause stale signals
- Execution slippage exceeds backtest assumptions
- Risk controls triggered too late
See Steve — Trading Bot and Trading Systems & Platforms for production trading patterns.
LLM/RAG Systems
LLM systems fail in novel ways:
- Retrieval quality collapse under new query patterns
- Context window misuse leading to “lost in the middle” failures
- Prompt injection and adversarial inputs
- Cost spiraling from inefficient token usage
Building resilient systems
The antidote to these failures is systematic engineering aligned with Applied AI delivery:
1. Establish baselines before launch
Define and measure:
- Expected input distributions
- Baseline model performance
- Latency percentiles (P50, P95, P99)
- Business metric targets
2. Implement comprehensive monitoring
Layer your monitoring:
- Infrastructure: CPU, memory, latency
- Pipeline health: Data freshness, feature completeness
- Model health: Prediction distributions, confidence scores
- Business outcomes: The metrics that matter to stakeholders
3. Define clear ownership
Create explicit contracts:
- SLAs for data quality from upstream teams
- On-call rotations for model serving
- Escalation paths for production incidents
- Regular review cadence for model performance
4. Build graceful degradation
When things break (and they will):
- Fall back to simpler models or rules
- Return explicit uncertainty signals
- Fail open or closed based on business requirements
- Alert humans when confidence is low
Key takeaways
- Most failures are system failures: Focus on pipelines, not just models
- Monitor everything: Input distributions, predictions, and outcomes
- Own the system end-to-end: Clear accountability prevents gaps
- Plan for degradation: Build fallbacks before you need them
- Learn from incidents: Every failure is a chance to improve