Viroom Interior Fitting Room
At a glance
- Industry: Retail / Interior Design
- Focus: Computer vision, LLM integration, real-time AR experience
- Goal: Reliable hybrid AI experience for virtual interior fitting
- Duration: 6 months from concept to production
Context
Viroom was designed to let users visualize furniture and decor in their own spaces using their smartphone camera. The system needed to combine real-time computer vision for room understanding with product catalog matching and a responsive user interface.
The challenge was not just technical accuracy — it was creating an experience that felt instant and trustworthy. Users expect AR to “just work.” Any lag, jitter, or obvious errors break the illusion and destroy user trust.
Users don’t care about your inference latency. They care that the chair looks right in their living room.
Challenge
Primary objective: Deliver a reliable hybrid AI experience that feels instant while handling complex CV and catalog matching behind the scenes.
Key constraints:
- Inference latency under 300ms for interactive responsiveness
- 95%+ uptime for the hybrid pipeline
- Graceful handling of edge cases (poor lighting, unusual rooms, occluded objects)
- Smooth user experience across device capabilities
Technical Approach
Room Understanding Pipeline
The computer vision pipeline processed camera frames in real-time:
- Plane detection: Floor, walls, ceiling surfaces
- Depth estimation: Relative distances for proper object scaling
- Lighting analysis: Ambient light direction for realistic shadows
- Occlusion handling: Understanding what’s in front of what
We used a multi-stage approach with early exits for simple scenes. Not every frame needs full processing — when the camera is stable, we can reuse previous results.
Object Placement Engine
Once room geometry was understood, the placement engine handled:
- Anchor points: Stable positions for virtual objects
- Scale matching: Products sized correctly for the room
- Collision avoidance: Objects don’t clip through walls or furniture
- Shadow rendering: Soft shadows that match ambient lighting
The placement engine was designed for stability. Small camera movements shouldn’t cause objects to jump or jitter.
Catalog Integration
Product matching connected the CV output to the product catalog:
- Style matching: Suggest products that fit the room aesthetic
- Size filtering: Only show products that physically fit
- Availability: Real-time inventory status
- Personalization: User preference learning over time
We used a lightweight embedding model for style matching, optimized for inference speed rather than maximum accuracy.
Orchestration Layer
The orchestration layer coordinated all components:
- Frame scheduling: Prioritize processing for visible areas
- Resource management: Balance CPU/GPU across components
- Fallback handling: Graceful degradation on device limitations
- State management: Consistent experience across app lifecycle
Trade-offs
| Decision | Trade-off |
|---|---|
| Stability over precision | Objects stay put even if placement isn’t perfect |
| Early exits | Faster response at cost of occasional missed updates |
| Lightweight models | Lower accuracy for faster inference |
| Conservative occlusion | Some visible clipping to avoid false occlusions |
- User experience over perfect accuracy. A stable, slightly imperfect placement is better than a jittery, technically correct one. Users need to trust what they see.
- Operational stability over complexity. Fewer moving parts means fewer failure modes. We resisted adding features that would complicate the critical path.
- Device constraints as first-class requirements. The system had to work on mid-range phones, not just flagship devices. This drove architecture decisions from day one.
Results
| Metric | Outcome |
|---|---|
| Inference latency | 100–300 ms for interactive CV |
| Pipeline uptime | 95–99% stability |
| User experience | Smooth, stable object placement |
| Device coverage | 85% of target device range |
| Session completion | 40% increase in completed fitting sessions |
Stack
- CV Pipeline: Plane detection, depth estimation, lighting analysis
- Placement Engine: Anchor management, collision detection, shadow rendering
- Orchestration: Frame scheduling, resource management, state handling
- Monitoring: Latency tracking, error rates, device performance profiles
Key Learnings
- Production quality comes from system design, not just model selection. The best model in the world won’t save a poorly orchestrated system.
- Hybrid AI needs orchestration and observability to stay stable. When multiple models interact, you need to see what’s happening at every stage.
- User perception matters more than technical metrics. A 200ms latency that feels smooth beats a 150ms latency that feels jerky.
- Design for the median device, not the best device. Most users don’t have flagship phones. The system needs to work for everyone.
Architecture Highlights
The system was designed around three principles:
1. Temporal stability
- Results should be consistent across frames
- Small inputs changes shouldn’t cause large output changes
- Jitter is worse than imprecision
2. Graceful degradation
- If one component fails, others should continue
- Lower-quality fallbacks are better than no results
- User should never see a blank screen
3. Observable behavior
- Every decision should be traceable
- Performance metrics available in real-time
- Anomalies detected and alerted automatically