Case Notes: MobilEA
Hybrid AI systems — those combining computer vision, optimization algorithms, and real-time decision-making — represent some of the most complex production deployments. They also fail in the most interesting ways.
MobilEA integrated multiple AI components into a unified mobility workflow. This post captures the lessons learned. For the full context, see the MobilEA case study.
The integration challenge
A mobility system is only as reliable as its weakest component. When you combine:
- Computer vision for vehicle/asset detection
- Optimization algorithms for route planning
- Real-time orchestration for live coordination
- User interfaces for operator interaction
…you create a system where failures cascade in unexpected ways.
What we learned
Lesson 1: Interface contracts are non-negotiable
The seams between components are where hybrid systems fail. A CV model that outputs bounding boxes expects downstream consumers to handle those boxes consistently. When assumptions differ:
- CV outputs pixel coordinates; optimization expects GPS
- Detection confidence scales differ between models
- Timestamp formats vary between systems
- Missing data handling is inconsistent
We learned to define explicit interface contracts:
- Schemas for every data exchange
- Validation at every boundary
- Clear error handling for malformed data
- Version management for evolving interfaces
This is a recurring Applied AI pattern — systems fail at boundaries.
Lesson 2: Orchestration complexity explodes
With multiple AI components, orchestration becomes its own problem:
- Dependency management: Which components need results from others?
- Timeout handling: What happens when one component is slow?
- Partial failures: How do you proceed when only some components succeed?
- State management: Where is the source of truth?
We implemented:
- Explicit dependency graphs between components
- Timeout budgets per component with fallback behaviors
- Partial result handling with degraded but functional outputs
- Event sourcing for complete state reconstruction
Lesson 3: End-to-end latency is the constraint
Individual component latency looked good:
- CV inference: 150ms
- Optimization solver: 200ms
- API calls: 50ms each
But end-to-end paths compound:
- Sequential processing: 150 + 200 + 50 + 50 + 50 = 500ms
- Add network variability: P95 jumped to 1.2s
- Under load: P99 exceeded 3s
Sub-second UX required:
- Parallelizing independent operations
- Aggressive caching of intermediate results
- Speculative computation for likely scenarios
- Streaming partial results to UI
Lesson 4: Monitoring needs to be holistic
Component-level monitoring wasn’t enough:
- Each component showed green
- End-to-end user experience was poor
- Root cause wasn’t in any single component
We added:
- End-to-end transaction tracing (distributed tracing)
- Cross-component correlation IDs
- Business outcome monitoring (successful trips, not just API calls)
- SLOs defined at user journey level, not component level
Metrics snapshot
Typical performance ranges for production mobility orchestration:
| Metric | Range |
|---|---|
| Decision latency (critical paths) | Under 1 second |
| Orchestration service availability | 99.5–99.9% |
| End-to-end success rate | >95% of initiated operations |
| Fallback activation rate | Under 5% of requests |
Technical architecture patterns
Pattern 1: Circuit breakers everywhere
When a downstream component fails:
- Don’t retry forever (cascade failures)
- Open the circuit (fail fast)
- Provide degraded alternatives
- Close gradually as health returns
Pattern 2: Bulkhead isolation
Separate components into isolated pools:
- Slow CV processing doesn’t block optimization
- Failed optimization doesn’t prevent basic routing
- User operations have reserved capacity
Pattern 3: Eventual consistency with optimistic UI
For better UX:
- Show optimistic results immediately
- Validate and correct in background
- Handle conflicts gracefully
- Communicate corrections clearly to users
Pattern 4: Shadow mode for new components
Before production rollout:
- Run new components in shadow mode
- Compare outputs to production system
- Measure accuracy and latency in real conditions
- Gradually shift traffic as confidence grows
Key takeaways
- Interface contracts prevent integration failures: Define, validate, version
- Orchestration is a first-class concern: Not just “glue code”
- Optimize for end-to-end latency: Component latency is misleading
- Monitor user journeys, not just components: The user doesn’t care which component failed
- Design for graceful degradation: Partial function beats total failure