Object Detection

YOLO-based real-time object localization in video streams

The detection stage uses YOLO (You Only Look Once) to locate objects in video frames with bounding boxes and confidence scores.

Overview

YOLO processes each video frame to identify and localize objects, outputting bounding box coordinates for downstream matching.

YOLO Architecture

Component	Description
Backbone	Feature extraction network (CSPDarknet)
Neck	Feature pyramid network (PANet)
Head	Detection output layer

Model Variants

Model	Size	Speed	Accuracy	Use Case
YOLOv8n	6MB	Fastest	Good	Edge devices, real-time
YOLOv8s	22MB	Fast	Better	Balanced performance
YOLOv8m	52MB	Medium	High	Server deployment
YOLOv8l	87MB	Slower	Higher	High accuracy needs
YOLOv8x	137MB	Slowest	Best	Maximum accuracy

Detection Pipeline

Step	Action	Output
1	Frame capture	Raw image (BGR)
2	Preprocessing	Resized, normalized tensor
3	YOLO inference	Raw predictions
4	Non-max suppression	Filtered detections
5	Output formatting	Bounding boxes + scores

Detection Output

Each detection includes:

Field	Type	Description
`bbox`	[x1, y1, x2, y2]	Bounding box coordinates
`confidence`	float	Detection confidence (0-1)
`class_id`	int	YOLO class index
`class_name`	string	Human-readable label

Configuration

Parameter	Default	Description
`conf_threshold`	0.25	Minimum confidence to keep
`iou_threshold`	0.45	NMS IoU threshold
`max_detections`	300	Maximum objects per frame
`img_size`	640	Input resolution

Performance Optimization

GPU Acceleration

Setting	Description
CUDA	NVIDIA GPU inference
TensorRT	Optimized NVIDIA inference
CoreML	Apple Silicon optimization

Batching

Approach	Latency	Throughput
Single frame	Low	Lower
Batch processing	Higher	Higher

Frame Processing

Mode	Description	Use Case
Stream	Continuous video processing	Live camera feeds
Batch	Process image directory	Offline analysis
Single	One image at a time	API requests

Multi-scale Detection

Feature	Description
FPN	Feature pyramid for small objects
Large objects	Detected at lower resolution
Small objects	Detected at higher resolution

Output Format

Detection results per frame:

Key	Type	Description
`frame_id`	int	Sequential frame number
`timestamp`	float	Frame timestamp
`detections`	array	List of detection objects
`inference_time`	float	Processing time (ms)

Object Enrollment

Register object classes and extract embeddings for the detection database

Object Matching

Compare detected objects against the enrolled object database

On this page

Overview YOLO Architecture Model Variants Detection Pipeline Detection Output Configuration Performance Optimization GPU Acceleration Batching Frame Processing Multi-scale Detection Output Format