Case Notes: Viroom Interior Fitting Room
Hybrid AI experiences — combining computer vision, language models, and augmented reality — promise magical user interactions. Delivering that magic reliably in production is where the engineering challenge lies.
Viroom’s interior fitting room lets users visualize furniture in their space before purchasing. This post shares lessons from building a system where multiple AI components must work together seamlessly. For the full case study, see Viroom Interior Fitting Room.
The experience challenge
Users expect instant, accurate results. They don’t care that:
- Room detection uses depth estimation and plane detection
- Furniture placement requires physics-aware positioning
- Style matching involves visual similarity models
- Natural language queries need intent classification
They just want to see how that sofa looks in their living room. Now.
What we learned
Lesson 1: Perception quality drives everything
The entire experience depends on accurate room understanding:
- Plane detection: Identifying floors, walls, and surfaces
- Dimension estimation: Understanding real-world scale
- Lighting analysis: Matching furniture rendering to ambient light
- Occlusion handling: Knowing what’s in front of what
When perception fails, everything downstream fails. We invested heavily in:
- Multi-frame consensus for stable plane detection
- User-guided calibration for ambiguous spaces
- Confidence scoring with fallback to manual placement
- Extensive testing across room types and lighting conditions
Lesson 2: Interactive latency is brutally strict
Users manipulating furniture in AR expect sub-100ms response times. This constrained our architecture:
| Operation | Target Latency | Approach |
|---|---|---|
| Furniture drag | Under 50ms | Local computation only |
| Physics settling | Under 100ms | Pre-computed constraints |
| Style search | Under 500ms | Pre-indexed embeddings |
| LLM suggestions | Under 2s | Streaming with placeholder UI |
We separated interactive (local) from generative (cloud) operations. Interactive elements never wait for network.
Lesson 3: LLM integration requires guardrails
Adding natural language interaction (“show me something more modern”) introduced new failure modes:
- Hallucinated products: LLM suggesting items that don’t exist in catalog
- Inconsistent style interpretation: “Modern” means different things
- Conversation drift: Users getting off-topic
- Adversarial prompts: Attempts to manipulate system behavior
We implemented:
- Constrained generation: LLM outputs must map to real catalog items
- Style taxonomy: Defined vocabulary with visual examples
- Conversation reset: Clear boundaries on interaction scope
- Input filtering: Blocking problematic prompt patterns
Lesson 4: Graceful degradation keeps users engaged
When AI components fail:
- CV failure: Fall back to manual furniture placement
- Recommendation failure: Show popular items in category
- LLM failure: Offer structured search interface
- Network failure: Local-first with sync when available
Users rarely noticed degraded modes because alternatives were well-designed.
Technical architecture
Real-time CV pipeline
For interactive AR:
- On-device depth estimation (not cloud)
- Frame-by-frame plane tracking with Kalman filtering
- Lazy re-computation only when camera moves significantly
- Level-of-detail rendering based on device capability
Hybrid recommendation engine
Combining multiple signals:
- Visual similarity: Embedding-based nearest neighbor search
- Style matching: Categorical filters (modern, traditional, minimalist)
- Context awareness: Room type, existing detected furniture
- Collaborative filtering: What similar users chose
Results blended with learned weights, updated weekly.
Conversational layer
LLM integration architecture:
- Intent classification (search, refine, compare, question)
- Slot filling for structured queries (style, color, size, price)
- Retrieval-augmented generation for product knowledge
- Response validation against catalog before display
Metrics snapshot
Typical performance for production interior AR:
| Metric | Range |
|---|---|
| CV inference latency | 100–300ms |
| Interactive response time | Under 100ms |
| Hybrid pipeline uptime | 95–99% |
| User satisfaction (post-session) | 4.2–4.5 / 5 |
| Conversion lift vs. static images | 15–25% |
Key takeaways
- Production quality comes from system design: Individual model accuracy is necessary but not sufficient
- Interactive latency shapes architecture: You can’t optimize your way out of network round-trips
- LLMs need strong guardrails: Hallucinations erode trust quickly
- Graceful degradation is a feature: Design alternatives, not just error states
- User perception of quality > technical accuracy: 80% correct instantly beats 95% correct with 2s delay