The most common mistake in AI projects? Treating AI like a web service — a stateless endpoint you call and forget. This mental model works for CRUD APIs. It fails catastrophically for AI systems.
Applied AI requires a fundamentally different approach: treating AI as a living system with its own lifecycle, dependencies, and failure modes. The teams that understand this ship systems that work. The teams that don’t ship demos that degrade. See concrete examples in MTRobot and Steve Trading Bot.
If you want the broad operational answer to why machine learning models degrade in production, start there. This article explains why the web-service mental model causes those failures in the first place.
The web service mental model
Traditional web services are relatively simple from an operational standpoint:
- Stateless: Each request is independent
- Deterministic: Same input produces same output
- Stable: Behavior doesn’t change without deploys
- Well-understood: Debugging follows familiar patterns
You build it, deploy it, monitor response codes and response time, and move on. SLA is uptime and latency. Done.
Why AI breaks every assumption of the web service model
AI systems violate all four assumptions above — and the violations are not edge cases. They are fundamental properties of how machine learning works.
1. State is everywhere
AI systems depend on:
- Training data (historical state)
- Feature stores (current state at inference time)
- Model weights (learned state)
- Context windows (session state for LLMs)
A “stateless” inference endpoint actually depends on gigabytes of hidden state. When any of that state changes — upstream data source, feature store freshness, model version — the output changes.
Real example: At MTRobot, the trading platform maintains persistent MT5 terminal connections, per-user execution contexts, and real-time position state. None of this is stateless — and treating it as such during design leads to race conditions and stale execution.
2. Non-determinism is the norm
Even with identical input:
- LLMs produce different outputs based on temperature sampling
- Models trained on different random seeds behave differently on edge cases
- Feature freshness affects predictions — a feature value from 5 minutes ago vs. 5 seconds ago may produce different scores
- A/B test routing creates divergent behavior paths
3. Silent degradation — the dangerous one
Web services fail loudly: 500 errors, timeouts, stack traces. AI systems fail quietly:
- Accuracy drops 10% over 3 months — no error thrown
- Edge cases get worse while aggregate metrics look stable
- Confidence scores drift upward while outcomes worsen
Real example: At Steve Trading Bot, market regime shifts can invalidate a model’s signals without producing any system error. The model continues generating predictions with the same confidence — but the underlying market structure has changed. Without explicit regime detection and monitoring, you don’t know until you’ve taken losses.
This is why model skewing detection is a core MLOps concern, not an optional monitoring add-on. It is one concrete slice of the broader problem of why models fail after deployment.
4. Novel failure modes with no web service equivalent
AI introduces failure categories that simply don’t exist for traditional services:
- Data drift: Input distribution shifts away from training distribution
- Concept drift: The relationship between inputs and outputs changes
- Training-serving skew: Model sees different features in production than in training
- Adversarial inputs: Crafted inputs that exploit model blind spots
- Hallucination: LLMs confidently generating factually wrong content
See why ML models degrade in production for a comprehensive breakdown of each.
The hidden costs of treating AI as a web service
When teams apply the web service mental model to AI, they pay these costs downstream:
Technical debt compounds faster. Hardcoded preprocessing in serving code diverges from training code. Schema changes in upstream data sources aren’t communicated as breaking changes. Every “quick fix” in the serving layer creates another potential training-serving skew source.
Retraining becomes expensive and risky. Without a clear data versioning strategy, retraining requires reconstructing the exact training dataset — which may no longer be possible if data sources have changed. Teams that don’t version training data eventually can’t reproduce past model behavior.
Monitoring overhead grows without structure. Adding monitoring reactively — after each incident — produces a pile of ad-hoc alerts with no coherent model health picture. Proactive monitoring architecture costs 1x to build. Reactive incident-driven monitoring costs 5-10x over time.
Rollback is harder than a web service. Rolling back a web service means deploying the previous Docker image. Rolling back an ML model means reverting model weights AND ensuring the data pipeline produces the same feature values the old model expects. Without versioned feature stores, this is painful.
The system mindset shift
To build AI that works in production, shift from “endpoint” to “system” thinking.
Treat data as infrastructure
Data is not input — it’s infrastructure:
- Version your datasets like code (DVC, Delta Lake, or even S3 + manifest files)
- Monitor data quality with automated schema and distribution tests at ingestion
- Track lineage from source to prediction
- Build explicit contracts with upstream data providers — they are dependencies
Practical example from AgrigateVision: camera hardware is a data dependency. When the vendor pushed a firmware update that changed image preprocessing, the model’s input distribution shifted without any code change on our end. A contract with the hardware vendor and automated input monitoring would have caught it before it caused a pipeline failure.
Design for observability from day one
You need visibility into four layers:
- Input distributions: Are production inputs similar to training?
- Prediction distributions: Is the model behaving normally?
- Outcome tracking: Are predictions actually correct? (Requires ground truth labels)
- Pipeline health: Is data flowing with expected freshness and completeness?
Plan for the full lifecycle
An AI system is never “done”:
| Phase | Activities |
|---|---|
| Development | Training, evaluation, iteration |
| Deployment | Serving, scaling, integration |
| Monitoring | Drift detection, alerting, segment analysis |
| Maintenance | Retraining, updating, deprecating |
The maintenance phase has no defined end. This is the most important difference from a web service — budgeting, staffing, and architecture must account for it.
Build feedback loops
Production data is your best training signal:
- Log model inputs and outputs
- Collect outcome labels when possible
- Build annotation pipelines for edge cases
- Use production data to drive retraining cadence
Domain-specific implications
The “AI is not a web service” principle manifests differently by domain:
Computer Vision
CV systems need:
- Input health monitoring (image quality metrics, exposure, noise)
- Drift detection for visual changes (seasonal, environmental, hardware)
- Edge device considerations — response time and connectivity constraints change the architecture
Our approach is detailed in Computer Vision in Applied AI.
Trading Systems
Trading bots need:
- Real-time risk controls that operate independently of ML predictions
- Reproducible backtests for auditability
- Market regime detection — the market changes and the model must know when it’s operating outside training conditions
- Execution quality monitoring (slippage, fill rates, latency)
See Trading Systems & Platforms for our approach.
LLM Applications
LLM systems need:
- Retrieval quality monitoring (RAG performance)
- Cost tracking per request (token usage)
- Safety guardrails (content filtering, prompt injection protection)
- User feedback loops
What changes in practice
Instead of: “We’ll build an API endpoint that returns predictions”
Think: “We’ll build a system that ingests data, trains models, serves predictions, monitors outcomes, and continuously improves”
Instead of: “The model is deployed, we’re done”
Think: “The model is deployed. Now we need to monitor, maintain, and iterate — indefinitely”
Instead of: “Our SLA is 99.9% uptime and sub-200ms response time”
Think: “Our SLA includes prediction accuracy, data freshness, segment-level performance, and business outcome targets”
- AI is a system, not an endpoint. State, non-determinism, and silent degradation are fundamental properties.
- Silent failures are worse than loud failures. Monitor predictions and outcomes, not just infrastructure.
- Data is infrastructure. Version it, contract it, monitor it — with the same rigor as code.
- Deployment is the beginning, not the end. Budget for monitoring, retraining, and maintenance — they never stop.
Frequently asked questions
Why do AI projects fail after deployment? The most common reasons: no monitoring for data drift or model degradation (so failures go undetected for weeks), training-serving skew (model sees different features in production than in training), no ownership clarity (nobody knows who to page when accuracy drops), and treating retraining as a one-time event. The underlying cause is applying web service assumptions to a system that violates all of them.
What is the difference between an AI system and a web service? A web service is stateless, deterministic, and stable — it only changes when you deploy new code. An AI system is stateful (depends on training data and feature stores), non-deterministic (same input can produce different outputs), and degrades silently over time as input distributions shift. These differences require a fundamentally different operational approach.
What is production ML monitoring? Production ML monitoring is the practice of tracking model health across four layers: input distributions (are inputs similar to training?), prediction distributions (is the model behaving normally?), outcome tracking (are predictions correct?), and business outcomes (are the metrics that matter improving?). Infrastructure monitoring (CPU, latency) is necessary but not sufficient.
What is training-serving skew in machine learning? Training-serving skew is when the model sees different data at inference time than it did during training — due to different preprocessing code, different feature versions, or different null handling. It’s one of the most common causes of production ML failures and is often invisible until you compare training feature distributions to live feature distributions.
How is AI delivery different from traditional software delivery? Traditional software delivery has a defined end state — you ship a version, it either works or it doesn’t. AI delivery is continuous: models degrade, data drifts, and retraining is an ongoing operational requirement. Success criteria include not just initial accuracy but sustained accuracy over time. Teams need data engineering, ML engineering, and platform capabilities working in coordination.