Object Detection
YOLO-based real-time object localization in video streams
The detection stage uses YOLO (You Only Look Once) to locate objects in video frames with bounding boxes and confidence scores.
YOLO processes each video frame to identify and localize objects, outputting bounding box coordinates for downstream matching.
| Component | Description |
|---|
| Backbone | Feature extraction network (CSPDarknet) |
| Neck | Feature pyramid network (PANet) |
| Head | Detection output layer |
| Model | Size | Speed | Accuracy | Use Case |
|---|
| YOLOv8n | 6MB | Fastest | Good | Edge devices, real-time |
| YOLOv8s | 22MB | Fast | Better | Balanced performance |
| YOLOv8m | 52MB | Medium | High | Server deployment |
| YOLOv8l | 87MB | Slower | Higher | High accuracy needs |
| YOLOv8x | 137MB | Slowest | Best | Maximum accuracy |
| Step | Action | Output |
|---|
| 1 | Frame capture | Raw image (BGR) |
| 2 | Preprocessing | Resized, normalized tensor |
| 3 | YOLO inference | Raw predictions |
| 4 | Non-max suppression | Filtered detections |
| 5 | Output formatting | Bounding boxes + scores |
Each detection includes:
| Field | Type | Description |
|---|
bbox | [x1, y1, x2, y2] | Bounding box coordinates |
confidence | float | Detection confidence (0-1) |
class_id | int | YOLO class index |
class_name | string | Human-readable label |
| Parameter | Default | Description |
|---|
conf_threshold | 0.25 | Minimum confidence to keep |
iou_threshold | 0.45 | NMS IoU threshold |
max_detections | 300 | Maximum objects per frame |
img_size | 640 | Input resolution |
| Setting | Description |
|---|
| CUDA | NVIDIA GPU inference |
| TensorRT | Optimized NVIDIA inference |
| CoreML | Apple Silicon optimization |
| Approach | Latency | Throughput |
|---|
| Single frame | Low | Lower |
| Batch processing | Higher | Higher |
| Mode | Description | Use Case |
|---|
| Stream | Continuous video processing | Live camera feeds |
| Batch | Process image directory | Offline analysis |
| Single | One image at a time | API requests |
| Feature | Description |
|---|
| FPN | Feature pyramid for small objects |
| Large objects | Detected at lower resolution |
| Small objects | Detected at higher resolution |
Detection results per frame:
| Key | Type | Description |
|---|
frame_id | int | Sequential frame number |
timestamp | float | Frame timestamp |
detections | array | List of detection objects |
inference_time | float | Processing time (ms) |