Context
When descriptions run out
Computer Vision is not needed where an object can be found by fixed coordinates or a template. It appears where visual reality is too diverse to describe with rules.
Barcode reading
Position is predictable
Contrast is always high
→ Structure is standard. CV is overkill here.
Defects on a production line
THEN defect?
...but conveyor shadow looks similar
IF scratch > 2mm
THEN defect?
...but some textures are acceptable
...thousands of variants, each new one
→ Rules won't cover the diversity of real defects.
People counting in a flow
THEN +1 person?
...but groups merge together
IF vertical silhouette
THEN person?
...what about a stroller or umbrella?
...occlusions, shadows, angles
→ You cannot describe visual chaos of the real world in rules.
01 / Foundation
Two approaches to image analysis
Classical image processing works with pixels directly. Computer Vision learns to "see" from examples — and finds on its own what to pay attention to.
02 / Mechanics
What the neural network sees
A model doesn't "recognize" an object as a whole. It breaks down the image layer by layer — from simple edges to complex shapes — extracting increasingly abstract features at each level.
Layers of a convolutional neural network (CNN)
Each layer "sees" the image at a different level of abstraction. Early layers catch basic patterns; deep layers — entire objects.
Four core CV tasks
CV is not one technology — it is a family of tasks. Each answers a different question: "what?", "where?", "how many?", "what shape?"
Classification
What is in the image? Answer — one label: "cat", "defect", "document type A".
Detection
Where are the objects? Draws a bounding box around each found object + class + confidence score.
Segmentation
What is the exact shape? Each pixel gets its own class — object contours are determined to the pixel.
Pose estimation
How is the body positioned? Skeleton key points — for motion analysis, physiotherapy, sports analytics.
OCR / text reading
What does it say? Text extraction from photos of documents, signs, handwriting — even angled and distorted.
Tracking
Where is the object moving? Following one or multiple objects across video frames over time.
Detection output is always a probability
Like any ML model, a CV model doesn't say "that's definitely a person." It outputs a confidence score for each found object. The cutoff threshold determines what to show and what to treat as noise.
Below — a simulator: adjust the confidence threshold to see how detections appear and disappear. Low threshold → more objects, but also false positives. High threshold → only confident detections, but misses grow.
How to measure accuracy? IoU — Intersection over Union
For detection it's not enough to know that an object was "found." You need to understand how precisely the model localized it. For this, IoU is used — a measure of overlap between the predicted and ground truth bounding boxes.
03 / Critical
Where CV "sees" the wrong thing
A neural network trains on data. If data is biased, noisy or doesn't represent reality — the model will confidently make wrong decisions. This is directly connected to production ML failure modes we document in our engineering blog.
Classic example: tank or landscape?
An early experiment: a network was trained to classify photos with tanks vs without. Accuracy was high. But the model had learned something entirely different.
We assumed the network would learn shapes and textures of military hardware.
Tank photos were taken on overcast days, landscape photos on sunny days. The model learned the weather, not the object.
Typical pitfalls in CV projects
Lighting change
Model trained on daytime photos — breaks at night or under artificial light. Lighting changes everything: colors, shadows, contrast.
New camera angle
Camera was frontal during training — but in deployment it's overhead. Models don't automatically transfer knowledge to new angles.
Domain gap
Trained on city streets — deployed in an industrial yard. Different textures, scales, backgrounds. Accuracy can drop 30–40%. See the AgrigateVision case study for how we handled this.
Annotation errors
Annotators labeled objects imprecisely — the model trains on noise. Garbage in, garbage out — in CV this is especially critical.
The cost of different errors varies by business
A CV model, like any ML model, makes two types of errors. And you cannot minimize both simultaneously. The choice belongs to the business.
04 / In practice
How a CV project is structured
Training a model is 10% of the work. The other 90% is data collection, annotation, camera integration and monitoring in real conditions. We've applied this end-to-end in the AgrigateVision and RoomIQ projects.
Task framing
What exactly needs to be "seen"? Which object classes, under what conditions, at what speed, with what acceptable error rate.
⚠ 60% of failures start hereData collection
Photos and video from real operating conditions. Different lighting, angles, scales, seasons. More diversity = more robust model.
💡 More critical than model architectureAnnotation (labeling)
Every object in every image is outlined with a box or mask. Manual work — expensive, slow, and error-prone.
⚠ Most expensive stage — up to 80% of data budgetModel training
Architecture selection (YOLO, EfficientDet, Mask R-CNN), augmentation, hyperparameters. Iterative experiments.
Validation and testing
mAP, IoU, precision/recall metrics on a held-out set. Edge case testing — rare but critical situations.
Optimization and deployment
Quantization, pruning, conversion to ONNX/TensorRT. Integration with cameras, edge devices, cloud. Latency, throughput.
⚠ Fast in the lab ≠ fast in productionMonitoring and fine-tuning
Conditions change: new products, seasonality, camera wear. Models degrade. Continuous quality control and periodic retraining are required. See our guide on data drift in production.
Where CV projects most often break
05 / For decision-makers
What every stakeholder needs to understand
Eight things that determine the difference between a working CV system and an expensive experiment. These apply equally to Applied AI and ML for business more broadly.
Data matters more than the model
The best architecture won't save you if data is scarce, homogenous or poorly labeled. Invest in data first.
Annotation is expensive
One frame with segmentation = 5–30 minutes of manual work. For 10,000 frames — months of effort. Plan the annotation budget upfront.
Lab ≠ production
mAP 95% on the test set can turn into 60% on-site. Different cameras, lighting, dirt on the lens, non-standard angles.
Edge cases decide everything
99% of frames the model handles fine. But the business breaks on the remaining 1% — the rare, atypical, critical situations.
The camera is part of the model
Resolution, focal length, fps, mounting angle — all affect accuracy. The camera is selected for the task, not the other way around.
Speed vs accuracy tradeoff
A model can be accurate but slow (Mask R-CNN on a server). Or fast but less accurate (YOLOv8 on edge). The tradeoff is unavoidable.
Models degrade over time
Seasons, equipment wear, new products, facility renovations — any environment change reduces accuracy. Monitoring is mandatory.
Signs of a successful project
Clear task definition · diverse data · quality annotation · pilot in real conditions · post-launch monitoring
06 / Diagnosis
Is CV right for your use case?
Before talking about neural networks and cameras — answer four questions. If even one answer is "no," a CV project is likely premature. This mirrors the broader ML readiness framework we apply across projects.
Ready to evaluate your CV opportunity?
We run a discovery phase to determine whether your use case has the data, conditions and business case for a successful CV project — and what the realistic timeline looks like.
Let's talk →