Layer 2-4: Detection & Tracking
Person detection with YOLOv8 and face detection with InsightFace SCRFD
The Detection layers use a dual-stage approach: YOLOv8 for person detection and tracking, followed by InsightFace SCRFD for face detection. This combination provides robust person tracking with accurate face recognition.
| Layer | Component | Technology | Purpose |
|---|
| 2 | Person Detection | YOLOv8 | Detect all people in frame |
| 3 | Tracking | ByteTrack | Assign persistent IDs |
| 4 | Face Detection | InsightFace SCRFD | Locate faces in full frame |
YOLOv8 from Ultralytics provides fast and accurate person detection.
| Feature | Description |
|---|
| Real-time Detection | 20-30+ FPS on modern hardware |
| High Accuracy | State-of-the-art object detection |
| Person Only | Filters to class 0 (person) |
| Confidence Scores | Detection confidence per person |
- Frame is passed to YOLOv8 model
- Model detects all objects in frame
- Filter to class 0 (person) only
- Return bounding boxes with confidence scores
- Pass to ByteTrack for tracking
| Model | Speed | Accuracy | Size | Use Case |
|---|
yolov8n.pt | Fastest | Good | ~6MB | Real-time (default) |
yolov8s.pt | Fast | Better | ~22MB | Balanced |
yolov8m.pt | Medium | High | ~52MB | Higher accuracy |
| Setting | Description | Default |
|---|
| YOLO Model | Detection model | yolov8n.pt |
| YOLO Confidence | Minimum confidence | 0.8 |
| YOLO Classes | Object classes | [0] (person) |
ByteTrack provides persistent tracking IDs across video frames.
| Feature | Description |
|---|
| Persistent IDs | Stable track IDs across frames |
| Motion Prediction | Handles temporary occlusions |
| Re-identification | Recovers ID when person re-enters |
| Multi-object | Tracks multiple people simultaneously |
- Receive person detections from YOLOv8
- Match detections to existing tracks using IoU and motion
- Assign new track IDs to unmatched detections
- Predict locations for temporarily lost tracks
- Return TrackedPerson objects with stable IDs
| Tracker | Description |
|---|
bytetrack.yaml | Default, fast and accurate |
botsort.yaml | Alternative with different motion model |
| Setting | Description | Default |
|---|
| YOLO Tracker | Tracker config file | bytetrack.yaml |
| Track Mode | all or known_only | all |
| Show Track ID | Display IDs in output | True |
InsightFace SCRFD detects faces in the full frame for maximum accuracy.
| Feature | Description |
|---|
| Multi-face Detection | Detect multiple faces in single frame |
| High Accuracy | Works with varied angles and lighting |
| Landmarks | 5-point facial landmarks included |
| Embeddings | 512-d embedding extracted per face |
| Confidence Score | Detection confidence (det_score) |
- Full frame is passed to InsightFace (not cropped to person boxes)
- SCRFD detects all faces in the image
- Each face includes bbox, det_score, landmarks, embedding
- Quality filtering removes low-quality detections
- Valid faces proceed to recognition and IoU matching
- Higher accuracy - SCRFD optimized for full-frame detection
- Better small face detection - Not constrained by person crop
- Faster - Single detection pass instead of per-person crops
- IoU matching - Links faces to tracked persons by spatial overlap
Not all detected faces are suitable for recognition. The system filters by:
| Filter | Threshold | Purpose |
|---|
| Min Face Size | 50 pixels | Ensure face detail |
| Quality Threshold | 0.5 | Detection confidence |
| Setting | Description | Default |
|---|
| Model Name | InsightFace model | buffalo_l |
| Detection Size | Input size for detector | 640x640 |
| GPU Support | Enable CUDA | Off |
| Min Face Size | Minimum face pixels | 50 |
| Quality Threshold | Minimum detection confidence | 0.5 |
| Model | Detection Speed | Accuracy | Size |
|---|
| buffalo_l | Medium | Highest | ~400MB |
| buffalo_m | Fast | High | ~200MB |
| buffalo_s | Fastest | Good | ~100MB |
| Size | Speed | Small Face Detection |
|---|
| 320x320 | Fast | Poor |
| 640x640 | Medium | Good (default) |
| 1280x1280 | Slow | Excellent |
After person tracking and face detection, IoU (Intersection over Union) matching links faces to tracked persons.
- For each tracked person bounding box
- Calculate IoU with each detected face bounding box
- Assign face with highest IoU to person
- Link face identity to person track ID
- Stable identity - Face identity linked to tracked person
- Identity caching - Last known identity preserved per track
- Handles occlusions - Identity survives temporary face loss
Each detected face object contains:
| Property | Type | Description |
|---|
| Bounding Box | Array | [x1, y1, x2, y2] coordinates |
| Detection Score | Float | Detection confidence (0-1) |
| Embedding | Array | 512-d face embedding |
| Landmarks (5) | Array | 5-point facial landmarks |
5-point landmarks include:
- Left eye center
- Right eye center
- Nose tip
- Left mouth corner
- Right mouth corner
Enable GPU for faster detection on both YOLOv8 and InsightFace.
| Hardware | YOLOv8 | InsightFace | Combined |
|---|
| CPU (i7) | ~50ms | ~100ms | ~150ms per frame |
| GPU (RTX 3060) | ~10ms | ~20ms | ~30ms per frame |
| GPU (RTX 4090) | ~3ms | ~5ms | ~8ms per frame |
| Status | Color |
|---|
| Known person | Green |
| Unknown | Red |
- Person bounding box (YOLOv8)
- Track ID (e.g., "#1")
- Person name
- Confidence score
- Minimum resolution: 640x480
- Face should be at least 50 pixels
- Even lighting preferred
- Avoid heavy shadows
- System handles multiple people per frame
- Each person tracked independently
- Each face matched to nearest person by IoU
- Profile faces may not detect reliably
- Occluded faces (masks) may have lower scores
- Very small faces filtered out
- Person without visible face still tracked (identity cached)