// Practical guide

Computer Vision —
it's not magic,
it's engineering

An honest explanation of how machine vision actually works — from pixels to business decisions. What a product owner needs to know to keep a CV project from becoming an expensive experiment with no outcome.

80%
of world data is visual
Sees ≠ understands
Accuracy isn't the metric — business outcome is

When descriptions run out

Computer Vision is not needed where an object can be found by fixed coordinates or a template. It appears where visual reality is too diverse to describe with rules.

Rules work

Barcode reading

Format is strictly defined
Position is predictable
Contrast is always high

→ Structure is standard. CV is overkill here.

Rules break

Defects on a production line

IF dark spot
THEN defect?
...but conveyor shadow looks similar
IF scratch > 2mm
THEN defect?
...but some textures are acceptable
...thousands of variants, each new one

→ Rules won't cover the diversity of real defects.

Hidden patterns

People counting in a flow

IF motion in zone
THEN +1 person?
...but groups merge together
IF vertical silhouette
THEN person?
...what about a stroller or umbrella?
...occlusions, shadows, angles

→ You cannot describe visual chaos of the real world in rules.

CV is needed not because a task is "visual." But because visual data fundamentally cannot be described with IF-ELSE chains — objects change shape, lighting, angle and occlude each other.

Two approaches to image analysis

Classical image processing works with pixels directly. Computer Vision learns to "see" from examples — and finds on its own what to pay attention to.

INPUT Image FILTERS Manual rules OUTPUT Result ↑ human manually sets filters, thresholds, contours
Works when conditions are stable: fixed lighting, uniform background, standard objects. Barcode reading, QR codes, OCR of printed text.
THOUSANDS Images LABELS Annotations TRAINING Neural net LEARNED Detection ↑ the network finds which features matter for recognition on its own
Essential with variable conditions: changing lighting, angles, backgrounds. Object detection, face recognition, defect detection, segmentation, pose estimation.

What the neural network sees

A model doesn't "recognize" an object as a whole. It breaks down the image layer by layer — from simple edges to complex shapes — extracting increasingly abstract features at each level.

Layers of a convolutional neural network (CNN)

Each layer "sees" the image at a different level of abstraction. Early layers catch basic patterns; deep layers — entire objects.

🖼️ INPUT Pixels 224×224×3 LAYER 1–2 Edges Lines, gradients, contours LAYER 3–5 Textures Patterns, repetitions LAYER 6–10 Parts Wheels, eyes, windows, paws DEEP Objects Cars, people, animals OUTPUT Decision Class + bbox + confidence from pixels → to abstraction → to decision
The key difference from classical processing: humans don't specify which features to look at. The neural network chooses on its own — from edges to semantic parts. That's why CNNs work where manual filters fail.

Four core CV tasks

CV is not one technology — it is a family of tasks. Each answers a different question: "what?", "where?", "how many?", "what shape?"

🏷️

Classification

What is in the image? Answer — one label: "cat", "defect", "document type A".

📦

Detection

Where are the objects? Draws a bounding box around each found object + class + confidence score.

✂️

Segmentation

What is the exact shape? Each pixel gets its own class — object contours are determined to the pixel.

🦴

Pose estimation

How is the body positioned? Skeleton key points — for motion analysis, physiotherapy, sports analytics.

🔤

OCR / text reading

What does it say? Text extraction from photos of documents, signs, handwriting — even angled and distorted.

🎯

Tracking

Where is the object moving? Following one or multiple objects across video frames over time.

Detection output is always a probability

Like any ML model, a CV model doesn't say "that's definitely a person." It outputs a confidence score for each found object. The cutoff threshold determines what to show and what to treat as noise.

Below — a simulator: adjust the confidence threshold to see how detections appear and disappear. Low threshold → more objects, but also false positives. High threshold → only confident detections, but misses grow.

// simulator: confidence threshold
🚶
🚗
🎒
🛑
🚶‍♀️
🐕
PERSON 94%
CAR 89%
BACKPACK 72%
SIGN 61%
PERSON 45%
DOG? 28%
50%
4
objects detected
possible
false positives at ↓ threshold
possible
misses at ↑ threshold
Low threshold — more objects found, but noise among them. High threshold — only confident results, but some real objects are missed. There is no perfect threshold — the business sets it, not the model.

How to measure accuracy? IoU — Intersection over Union

For detection it's not enough to know that an object was "found." You need to understand how precisely the model localized it. For this, IoU is used — a measure of overlap between the predicted and ground truth bounding boxes.

// IoU = intersection area / union area
▬ ground truth
▬ model prediction
20%
65%
Good. At IoU ≥ 50% the detection is typically counted as "correct". The closer to 100%, the more precise the localization.

Where CV "sees" the wrong thing

A neural network trains on data. If data is biased, noisy or doesn't represent reality — the model will confidently make wrong decisions. This is directly connected to production ML failure modes we document in our engineering blog.

The model doesn't "understand" the meaning of an image. It finds statistical patterns — and follows them, even when they are spurious.

Classic example: tank or landscape?

An early experiment: a network was trained to classify photos with tanks vs without. Accuracy was high. But the model had learned something entirely different.

🪖
tank
vs
🏞️
landscape
Model learns to distinguish:
vehicle contours camouflage tracks

We assumed the network would learn shapes and textures of military hardware.

☁️
overcast
=
🪖
tank?
Model learned:
overcast sky → tank

Tank photos were taken on overcast days, landscape photos on sunny days. The model learned the weather, not the object.

This is not an algorithm bug. The model found the shortest path to correct answers on the training set. The problem is in the data, not the architecture.

Typical pitfalls in CV projects

💡

Lighting change

Model trained on daytime photos — breaks at night or under artificial light. Lighting changes everything: colors, shadows, contrast.

📐

New camera angle

Camera was frontal during training — but in deployment it's overhead. Models don't automatically transfer knowledge to new angles.

🌧️

Domain gap

Trained on city streets — deployed in an industrial yard. Different textures, scales, backgrounds. Accuracy can drop 30–40%. See the AgrigateVision case study for how we handled this.

🏷️

Annotation errors

Annotators labeled objects imprecisely — the model trains on noise. Garbage in, garbage out — in CV this is especially critical.

The cost of different errors varies by business

A CV model, like any ML model, makes two types of errors. And you cannot minimize both simultaneously. The choice belongs to the business.

False Positive — false detection
Model "sees" an object that isn't there
Manufacturing / quality control
Normal part rejected
Discarded good product. Direct losses at scale.
expensive at volume
Video surveillance
False alarm on shadow or bird
Operator fatigue. They stop responding to real alerts.
tolerable individually, dangerous systemically
False Negative — missed object
Model fails to detect a real object
Manufacturing / quality control
Defective part passes inspection
Warranty claims, batch recalls, reputational damage.
critical — root cause of failures
Autonomous driving
Pedestrian not detected
A safety-of-life issue. Acceptable rate ≈ 0.
unacceptable

How a CV project is structured

Training a model is 10% of the work. The other 90% is data collection, annotation, camera integration and monitoring in real conditions. We've applied this end-to-end in the AgrigateVision and RoomIQ projects.

01

Task framing

What exactly needs to be "seen"? Which object classes, under what conditions, at what speed, with what acceptable error rate.

⚠ 60% of failures start here
02

Data collection

Photos and video from real operating conditions. Different lighting, angles, scales, seasons. More diversity = more robust model.

💡 More critical than model architecture
03

Annotation (labeling)

Every object in every image is outlined with a box or mask. Manual work — expensive, slow, and error-prone.

⚠ Most expensive stage — up to 80% of data budget
04

Model training

Architecture selection (YOLO, EfficientDet, Mask R-CNN), augmentation, hyperparameters. Iterative experiments.

05

Validation and testing

mAP, IoU, precision/recall metrics on a held-out set. Edge case testing — rare but critical situations.

06

Optimization and deployment

Quantization, pruning, conversion to ONNX/TensorRT. Integration with cameras, edge devices, cloud. Latency, throughput.

⚠ Fast in the lab ≠ fast in production
07

Monitoring and fine-tuning

Conditions change: new products, seasonality, camera wear. Models degrade. Continuous quality control and periodic retraining are required. See our guide on data drift in production.

Where CV projects most often break

// % of failed projects where this factor was present
Insufficient or non-diverse training data 72%
Poor annotation quality (labeling errors) 63%
Lab-to-production domain gap 55%
Wrong task framing or metric choice 44%
Problems are cumulative — most failed projects contain several factors simultaneously. CV is particularly sensitive to the gap between training conditions and deployment conditions (domain gap).

What every stakeholder needs to understand

Eight things that determine the difference between a working CV system and an expensive experiment. These apply equally to Applied AI and ML for business more broadly.

01

Data matters more than the model

The best architecture won't save you if data is scarce, homogenous or poorly labeled. Invest in data first.

02

Annotation is expensive

One frame with segmentation = 5–30 minutes of manual work. For 10,000 frames — months of effort. Plan the annotation budget upfront.

03

Lab ≠ production

mAP 95% on the test set can turn into 60% on-site. Different cameras, lighting, dirt on the lens, non-standard angles.

04

Edge cases decide everything

99% of frames the model handles fine. But the business breaks on the remaining 1% — the rare, atypical, critical situations.

05

The camera is part of the model

Resolution, focal length, fps, mounting angle — all affect accuracy. The camera is selected for the task, not the other way around.

06

Speed vs accuracy tradeoff

A model can be accurate but slow (Mask R-CNN on a server). Or fast but less accurate (YOLOv8 on edge). The tradeoff is unavoidable.

07

Models degrade over time

Seasons, equipment wear, new products, facility renovations — any environment change reduces accuracy. Monitoring is mandatory.

Signs of a successful project

Clear task definition · diverse data · quality annotation · pilot in real conditions · post-launch monitoring

Is CV right for your use case?

Before talking about neural networks and cameras — answer four questions. If even one answer is "no," a CV project is likely premature. This mirrors the broader ML readiness framework we apply across projects.

Question 01
Is the task visual by nature?
Does the solution actually depend on what is "visible" in the image? Or can the problem be solved with structured data (tables, logs, sensors) without a camera?
Question 02
Do you have labeled data or the ability to obtain it?
Not just "cameras are recording." Specifically: are there thousands of frames with annotated objects (boxes, masks, classes)? Or a budget for annotation?
Question 03
Are operating conditions predictable?
Is the camera in a fixed location? Is lighting controlled? Or do conditions vary significantly — outdoors, weather, moving objects, night?
Question 04
Are model errors acceptable?
CV is a probabilistic system. It will sometimes be wrong. Does the process have a safety net — a human validator, double-check, or fallback?
Answer the questions above
Your diagnosis will appear here

Ready to evaluate your CV opportunity?

We run a discovery phase to determine whether your use case has the data, conditions and business case for a successful CV project — and what the realistic timeline looks like.

Let's talk →