Object Detection with YOLO

Introduction

Image classification answers a simple question: what is in this image?
Object detection answers a harder one: what objects are in this image, and where are they?

That means object detection must do two jobs at once:

Classify each object, such as person, car, dog, or bicycle
Localize each object by drawing a bounding box around it

This makes object detection much more challenging than image classification.
Instead of giving one label for a full image, the model must find multiple objects, separate them from the background, and place boxes in the correct locations.

Object Detection Placeholder

From Classification to Detection

Before studying YOLO, it helps to separate three related tasks.

Image Classification

In image classification, the model looks at the whole image and predicts one main label.

Example:

A photo of a dog → dog
A photo of a car → car

This is the simplest task because the model does not need to say where the object is.

Classification

Localization

Localization adds one more job.
The model must predict:

What the object is
Where it is

So instead of only saying “dog,” it also draws a bounding box around the dog.

Localization

Object Detection

Object detection goes one step further.
Now the image may contain many objects, and the model must detect all of them.

Example:

One image may contain a dog, a bicycle, and two people
The model must output four bounding boxes
Each box must have a class label and a confidence score

Object Detection

This is why object detection is harder:

There may be multiple objects
Objects can overlap
Objects can be small or large
Some objects may appear only partly in the image
The background may be cluttered

A good mental model is this:

Classification = What is here?
Localization = What is here, and where?
Detection = What objects are here, where are they, and how many are there?

What is Object Detection?

Object detection is the task of finding objects in an image and predicting a bounding box and class label for each one.

A typical object detector outputs something like this:

Box 1: person, 0.96 confidence
Box 2: dog, 0.91 confidence
Box 3: bicycle, 0.88 confidence

Each prediction usually contains:

A bounding box
A class label
A confidence score

What is a Bounding Box?

A bounding box is a rectangle that surrounds an object.

It tells the model where the object is located.
Boxes are commonly represented in one of two ways:

Corner format: \((x_1, y_1, x_2, y_2)\)
Center format: \((x, y, w, h)\)

Where:

\(x, y\) = box center
\(w\) = width
\(h\) = height

The goal is not just to classify correctly, but to place the box as accurately as possible.

Bounding Box

The Old Approach: Sliding Windows

Before modern detectors like YOLO became popular, one common approach was the sliding windows algorithm.

What is the Sliding Windows Algorithm?

The idea is simple:

Take a small window
Place it over one part of the image
Run a classifier on that window
Move the window slightly
Repeat across the whole image

The classifier asks at each step:

“Does this small region contain an object?”
“If yes, what object is it?”

This process is repeated many times over many parts of the image.

Why Sliding Windows Makes Sense

This approach is intuitive because it turns detection into many small classification problems.

Instead of asking the model to understand the whole image at once, we ask:

Does this patch contain a dog?
Does this patch contain a car?
Does this patch contain a person?

That idea is conceptually easy to understand, which is why it is a good learning starting point.

The Main Problem

Sliding windows is very slow.

Why?

Because the model must check:

Many positions
Many window sizes
Many aspect ratios

For one image, that can mean thousands of separate classifications.

Why Multiple Sizes Are Needed

Objects do not all appear at the same scale.

A nearby car may be large
A far-away car may be tiny

So one fixed-size window is not enough.
You need many windows of different sizes.

That increases the computation even more.

Main Weaknesses of Sliding Windows

Too slow for real-time use
Repeats similar computations again and again
Struggles with different object sizes efficiently
Produces many overlapping candidate boxes
Becomes expensive as images get larger

Why It Still Matters

You should learn sliding windows because it explains the motivation behind modern object detection.

It helps answer an important question:

Why did researchers want a new approach like YOLO?

Because they wanted a detector that could look at the image once instead of thousands of times.

Sliding Window

The Big Shift: Single-Shot Detectors

Modern object detection became much faster when models stopped treating detection as thousands of separate classification problems.

What is a Single-Shot Detector?

A single-shot detector predicts object locations and class labels in one forward pass through the network.

That means the model looks at the image once and directly outputs:

Bounding boxes
Confidence scores
Class predictions

This is very different from the sliding windows idea.

Why “Single-Shot” Matters

The phrase “single-shot” means detection happens in one main pass rather than through many repeated searches.

This creates major advantages:

Much faster inference
Better fit for real-time applications
More efficient use of shared image features

Instead of separately analyzing every patch, the network learns the whole detection task end to end.

Why This Leads to YOLO

YOLO is one of the best-known single-shot detectors.

Its name stands for:

You Only Look Once

That name captures the core idea perfectly:

Do not scan the image piece by piece
Do not run thousands of mini-classifications
Look once, predict everything

What is the YOLO Algorithm?

YOLO is an object detection algorithm that predicts bounding boxes and class probabilities directly from the image in one pass.

The Core Idea

YOLO treats object detection as a single learning problem.

Instead of splitting the task into many separate steps, it trains one model to learn:

Where objects are
Whether an object exists
What class each object belongs to

This makes YOLO simple in concept and powerful in practice.

The Grid Idea

A beginner-friendly way to understand classic YOLO is this:

Divide the image into a grid
Each grid cell becomes responsible for detecting objects whose center falls inside it
Each cell predicts one or more bounding boxes
Each predicted box includes a confidence score and class information

This helps the model organize predictions across the image.

Grid YOLO

What YOLO Predicts

For each candidate object, YOLO typically predicts:

Box position
Box size
Objectness score
Class probabilities

Let’s break that down.

1. Box Position

The model predicts where the object is located.

2. Box Size

The model predicts how wide and tall the object is.

3. Objectness Score

This answers:

“Is there really an object here?”
“How confident is the model that this box contains something meaningful?”

4. Class Probabilities

If an object is present, the model predicts what it is:

person
dog
car
bicycle
and so on

Why YOLO Is Fast

YOLO is fast because the network shares computation across the entire image.

Instead of repeatedly analyzing overlapping regions, it processes the image once and reuses learned visual features to make all predictions together.

That makes YOLO especially useful in applications like:

Self-driving systems
Security cameras
Robotics
Sports analytics
Mobile vision apps

The Main Tradeoff

YOLO gains speed by making strong, direct predictions.

Historically, this sometimes meant a tradeoff:

Extremely fast detection
But more difficulty with very small objects or crowded scenes

That tradeoff has improved a lot across newer YOLO versions, but the core learning idea stays the same.

How YOLO Makes Predictions

Prediction Workflow

Step 1: The Image Enters the Network

The full image is sent into a convolutional neural network.

The network extracts visual features such as:

Edges
Shapes
Textures
Patterns

These features help the model recognize objects.

Step 2: The Model Predicts Candidate Boxes

The detector predicts many possible boxes across the image.

Each candidate box usually includes:

Position
Size
Objectness
Class scores

At this stage, there may be many candidate boxes for the same object.

Step 3: Low-Confidence Predictions Are Removed

Predictions with very low confidence are usually filtered out first.

This reduces noise and keeps only boxes that seem meaningful.

Step 4: Overlapping Predictions Are Cleaned Up

Many boxes may point to the same object.

So the detector applies non-max suppression to remove duplicates.

Step 5: Final Detections Remain

The final result is a smaller set of predictions:

One box per object, ideally
A class label
A confidence score

This makes the output readable and useful.

IoU: Intersection over Union

IoU is one of the most important ideas in object detection.

What Is IoU?

IoU stands for Intersection over Union.

It measures how much two bounding boxes overlap.

Usually, those two boxes are:

The predicted box
The ground-truth box (the correct answer from the labeled dataset)

IoU

Why IoU Matters

A predicted box may classify the right object but still be badly placed.

IoU helps answer:

How well does the predicted box match the true box?

A higher IoU means better overlap.

How Do You Calculate IoU?

IoU is defined as:

\[ IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}} \]

Where:

Overlap = the shared area between the predicted box and the true box
Union = total area covered by both boxes together

Intuition

If two boxes do not overlap at all, IoU = 0
If they overlap perfectly, IoU = 1
If they overlap partly, IoU is somewhere between 0 and 1

Simple Example

Suppose:

Predicted box area = 100
Ground-truth box area = 120
Overlap area = 80

Then:

\[ IoU = \frac{80}{100 + 120 - 80} = \frac{80}{140} \approx 0.57 \]

So the boxes overlap fairly well, but not perfectly.

Why IoU Is Used

IoU is used in at least two important ways:

To decide whether a predicted box counts as correct
To help remove duplicate predictions during non-max suppression

In many evaluation settings, a prediction is considered correct only if its IoU is above some threshold, such as 0.5.

IoU Placeholder

What is Non-Max Suppression?

When a detector looks at an object, it often predicts several similar boxes around the same thing.

That creates duplicates.

The Problem

Imagine one dog in the image, but the model predicts:

Box A with confidence 0.95
Box B with confidence 0.91
Box C with confidence 0.87

All three boxes overlap heavily and describe the same dog.

Without cleanup, the detector would appear to find three dogs instead of one.

What Non-Max Suppression Does

Non-max suppression, often called NMS, removes redundant boxes and keeps the best one.

The word “max” refers to keeping the box with the highest confidence.

How NMS Works

Start with all predicted boxes
Keep the box with the highest confidence
Compare it with the remaining boxes
Remove boxes whose IoU with it is too high
Repeat with the next highest-confidence box

Why IoU Appears Again

NMS uses IoU to decide whether two boxes are “too similar.”

If two boxes overlap a lot, they probably refer to the same object.

So NMS suppresses the weaker one.

Example

Suppose the top-scoring box has confidence 0.95.
If another box overlaps it with IoU 0.80, and the NMS threshold is 0.5, the second box is usually removed.

Why?

Because it is probably just a duplicate prediction of the same object.

Important Clarification

NMS does not improve a bad box.
It only filters overlapping predictions.

That distinction is important.

You may think NMS “fixes” detection quality.
It does not.

It simply chooses which predictions survive.

What Are Anchor Boxes?

Anchor boxes are a way to help the detector predict objects of different shapes and sizes.

Why They Are Needed

A single location in an image may need to describe different possible objects:

A tall person
A wide car
A small dog
A narrow bottle

If the model predicted only one generic box shape at each location, it would struggle to represent all these possibilities.

The Main Idea

Anchor boxes are predefined reference boxes.

They come in different:

Widths
Heights
Aspect ratios

The model does not always predict a box completely from scratch.
Instead, it learns how to adjust these reference boxes to better match actual objects.

Intuition

Think of anchor boxes as starting templates.

For example, at one image location the model may consider:

A tall narrow box
A medium square box
A wide short box

Then it learns which one best fits the object and how to shift or resize it.

Why Anchor Boxes Help

They make it easier to detect:

Multiple objects from the same region
Objects with different proportions
Objects that appear at different scales

Common Pitfalls

Anchor boxes are not actual detections.
They are starting shapes used to guide prediction.

That is an important distinction.

How YOLO Uses Confidence

In object detection, confidence is not just “how sure the class is.”

Practicioners often confuse several different scores, so it helps to separate them.

Objectness Score

This score answers:

Is there an object in this box?
Or is this mostly background?

Class Probability

This score answers:

If an object is present, what class is it?

Confidence in Practice

A final detection score often reflects both:

Whether an object exists
How likely the predicted class is

So a box with high class probability but low objectness may still not be trusted much.

This is why two boxes with the same label can still have different final scores.

How to Evaluate Object Detection

A detector is not judged only by whether it names objects correctly.

It must also place boxes accurately.

That is why object detection evaluation is more complex than simple classification accuracy.

Precision and Recall

Before mAP, you should know two basic ideas.

Precision

Precision asks:

Of all the boxes the model predicted as objects, how many were correct?

High precision means:

Few false positives
The model does not make many bad detections

Recall

Recall asks:

Of all the true objects in the image, how many did the model find?

High recall means:

Few missed objects
The model finds most of what is actually there

These two ideas often trade off against each other.

A very cautious detector may have high precision but low recall
A very aggressive detector may have high recall but lower precision

What is mAP?

mAP is the standard summary metric for object detection.

It is one of the most important evaluation concepts to understand.

What Does mAP Mean?

mAP stands for mean Average Precision.

Let’s separate the parts.

Precision tells how many predicted detections are correct
Average Precision (AP) summarizes precision across different recall levels for one class
mAP is the mean of AP values across classes

So mAP gives one overall number that reflects detection quality across the dataset.

Why mAP Is Better Than Plain Accuracy

Plain accuracy is not enough for object detection because:

There may be multiple objects per image
Bounding box quality matters
Confidence thresholds matter
Duplicate detections matter

mAP captures this more realistically.

High-Level Process for Calculating mAP

A begineer-friendly version is:

Choose a class, such as “dog”
Collect all predicted dog boxes from the dataset
Sort them by confidence score
Match predictions to ground-truth boxes
Decide whether each prediction is a true positive or false positive using an IoU threshold
Build a precision-recall curve
Compute the area under that curve = AP for that class
Repeat for all classes
Average the AP values = mAP

How IoU Is Used in mAP

A prediction is usually counted as correct only if:

The class is correct
The IoU with a ground-truth box is above a threshold

For example:

mAP@0.5 means a prediction counts as correct if IoU \(\geq 0.5\)
A stricter evaluation may require higher overlap

AP vs mAP

This is a common source of confusion.

AP = one class
mAP = average across classes

Example:

AP for person = 0.82
AP for dog = 0.76
AP for bicycle = 0.70

Then mAP is the average of those values.

What About mAP@0.5:0.95?

Some benchmarks report mAP over multiple IoU thresholds, often from 0.5 to 0.95.

This is stricter than only using 0.5.

Why?

Because the detector must perform well not only at loose overlap thresholds, but also at tighter ones.

That rewards more accurate localization.

How the Core Ideas Connect

At this point, the major concepts should fit together like one system.

Object detection finds objects and their locations
Sliding windows is the older, slower way to do this
Single-shot detectors make predictions in one pass
YOLO is a famous single-shot detector
IoU measures how well boxes overlap
Non-max suppression removes duplicate boxes
Anchor boxes help model different shapes and sizes
mAP measures overall detection performance

This is the big picture:

The model predicts many boxes
Confidence scores rank those boxes
IoU helps judge overlap quality
NMS removes duplicates
mAP evaluates how well the full system performs

Common Pitfalls

These are the mistakes practicioners make most often.

1. Confusing Classification with Detection

Classification predicts one label for an image.
Detection predicts many boxes plus labels.

2. Thinking IoU Measures Confidence

It does not.
IoU measures overlap between boxes.

3. Thinking NMS Improves Prediction Quality

It does not fix bad boxes.
It only removes redundant ones.

4. Thinking Anchor Boxes Are Final Objects

They are just reference shapes used during prediction.

5. Ignoring Box Location During Evaluation

A detector can guess the right class and still be wrong if the box is badly placed.

6. Confusing AP with mAP

AP is for one class.
mAP is averaged over classes.

7. Believing YOLO Is Only About Speed

YOLO is famous for speed, but the deeper idea is unified end-to-end detection in one model.

Intuition Recap

A good way to remember the topic is this:

Sliding windows: search everywhere, many times
Single-shot detection: predict everything together
YOLO: one network, one pass, many predictions
IoU: how well boxes overlap
NMS: remove duplicate boxes
Anchor boxes: starting shapes for flexible predictions
mAP: the main score for detection quality

If you understand these seven ideas clearly, you already have a strong foundation in object detection.

Final Learning Checklist

You should be able to explain each of these after reading:

What object detection is
Why it is harder than classification
How sliding windows works
Why sliding windows is slow
What a single-shot detector is
Why YOLO is fast
What YOLO predicts
How IoU is computed
Why non-max suppression is necessary
What anchor boxes do
What mAP measures

That is the conceptual foundation needed before moving on to training pipelines, loss functions, datasets, or implementation details.