LogoVisionLog

VisionLog Overview

High-performance AI modular platform for vision-based monitoring and analytics

VisionLog is a modular AI platform designed for advanced computer vision monitoring, forensic analysis, and real-time behavioral intelligence. By leveraging state-of-the-art deep learning models, VisionLog provides a flexible ecosystem for various specialized operations, including automated attendance tracking, security monitoring, and identity search.

AI Attendance Tracking

AI Attendance is a primary module within the VisionLog platform. It automates attendance logging using CCTV cameras, webcams, RTSP streams, or video files with AI-powered face recognition. The system uses a dual-stage pipeline: YOLOv8 for person detection and tracking, combined with InsightFace (buffalo_l model) for face detection and recognition.

Block Diagram

Live Vision Operation

The Live Vision module offers real-time situational awareness by processing live streams from CCTV, IP cameras, or webcams. It displays an annotated video feed with bounding boxes, track IDs, and an identification sidebar showcasing recognized individuals alongside real-time analytics like total person count and active track count.

This forensic tool allows operators to locate specific individuals across entire video archives or live feeds. By uploading a reference photo, the system utilizes high-precision neural architectures (InsightFace ArcFace) to scan for matches, triggering real-time alerts for high-similarity detections and maintaining detailed track logs with timestamps.

Real-time Person Tracking

A unified pipeline combining person detection (YOLOv8) and robust tracking (ByteTrack) with facial recognition. The system maintains persistent "Track IDs" even when faces are temporarily obscured or turned away, using an advanced Identity Cache to ensure identity continuity across the entire tracking duration.

Key Features

  • Multi-Source Input - Supports live camera streams, webcams, RTSP streams, and offline video files (MP4, AVI, MKV)
  • YOLOv8 Person Tracking - Persistent track IDs across frames with ByteTrack
  • InsightFace Engine - Uses SCRFD for face detection and ArcFace for recognition (buffalo_l model)
  • Folder-Based Enrollment - Organize person images in folders named by person for easy batch enrollment
  • Real-time Recognition - Process camera feed or video files with live face detection
  • Attendance Tracking - Automatic logging with identity caching and duplicate prevention
  • Zero-Lag RTSP - Multi-threaded IP camera processing with auto-reconnect
  • Comprehensive Logging - CSV logs with frame, timestamp, track ID, name, and confidence

How It Works

The system follows a dual-stage processing pipeline:

StepComponentTechnologyDescription
1EnrollmentInsightFaceCreate face database by enrolling persons from folders
2Video InputOpenCVCapture from camera, video files, or RTSP streams
3Person DetectionYOLOv8Detect all people in frame with bounding boxes
4TrackingByteTrackAssign persistent track IDs across frames
5Face DetectionInsightFace SCRFDDetect faces in full frame
6Face RecognitionInsightFace ArcFaceGenerate 512-d embeddings, match against database
7IoU MatchingNumPyLink detected faces to tracked persons
8Attendance LoggingCSVLog recognized persons with timestamps and track IDs

System Architecture

The architecture consists of modular components for different functions:

ModulePurpose
Face TrackerYOLOv8 person detection with ByteTrack
Face EngineInsightFace detection & recognition
EnrollmentFolder-based batch face enrollment
RecognitionReal-time face recognition
AttendanceAttendance tracking with identity caching
Video ProcessorVideo/webcam/RTSP processing
ConfigurationCentralized settings

Technology Stack

ComponentTechnology
Person DetectionYOLOv8 (Ultralytics)
Person TrackingByteTrack
Face DetectionInsightFace SCRFD
Face RecognitionInsightFace ArcFace (buffalo_l model)
Embedding StoragePickle (embeddings.pkl)
Image ProcessingOpenCV
LoggingCSV format
GPU SupportCUDA (optional)

Experimental Lab

Beyond the core attendance and surveillance modules, VisionLog includes an Experimental Lab — a dedicated section for advanced computer vision features under active development. These use cases explore capabilities that extend the platform into broader AI-vision domains.

Access the Experimental Lab via the Lab button in the main sidebar, or navigate directly to /experimental.


Face Draw (Randomizer)

An AI-powered participant randomizer that detects all faces in an uploaded group photo and uses them as entries in an animated draw. The system analyzes the image via the recognition engine, extracts individual face crops, and lets operators run a fair random selection using one of five animated draw methods.

Draw MethodDescription
Spin WheelClassic spinning wheel with each face as a segment
SpotlightSweeping spotlight that slows to land on the winner
Bubble PopFace bubbles eliminated one-by-one
Black HoleVortex pulls participants in until one remains
Cyber LockFuturistic targeting lock-on mechanism

External participants (not in the photo) can also be added manually.


Cross-Camera Tracker

Tracks a target individual across multiple simultaneous camera feeds using biometric identity matching. A reference image is used to scan live or recorded footage from different sources at once, enabling persistent identity tracking even when the subject moves between camera zones.

Key capabilities:

  • Multi-feed simultaneous identity search
  • Reference-image based biometric matching
  • Cross-zone continuity tracking

Sign to Text

A real-time American Sign Language (ASL) recognition system using the device webcam and MediaPipe hand landmark detection. The system captures hand gestures frame-by-frame and converts them into text output with a stability filter to prevent false commits.

Supports:

  • 16 ASL letters — A, B, C, D, F, I, K, L, O, R, S, U, V, W, X, Y
  • 7 word gestures — HELLO, GOOD, BAD, I LOVE YOU, STOP, ROCK ON, OK
  • Live detection history and clipboard export

Scans a server-side folder of images for a specific individual using a reference photo. The InsightFace ArcFace engine generates 512-dimensional embeddings to compare faces across all images in the dataset, returning matches ranked by similarity score.

Key capabilities:

  • Reference-photo based biometric search across an image archive
  • Configurable similarity threshold
  • Displays matched image paths and confidence scores

An extension of Folder Person Search where the reference face is captured live from the webcam instead of an uploaded image. The operator takes a snapshot, which is then used as the search query against a pre-indexed server-side folder.

Key capabilities:

  • Live webcam capture as the identity reference
  • Same InsightFace ArcFace biometric engine as Folder Person Search
  • Useful for on-site identification without needing a stored reference image

Photo Batch Analysis (Folder Identity Lab)

Processes a folder of images to discover and cluster all unique individuals present across the entire collection. The system runs face detection and clustering on every image and produces a person-first view — showing every photo in which each unique face appears.

Supports both sources:

  • Local Folder — Upload images directly from the browser
  • Server Folder — Point to a directory already indexed on the backend

Output includes: Unique person count, total face count, known-vs-unknown breakdown, per-person photo gallery.


A natural-language image search engine powered by Google Gemini AI. Instead of face-matching, operators describe what they're looking for in plain text (e.g. "photo with the most people", "image with happiest faces") and the system scores every image in a server folder against the query.

Key capabilities:

  • Free-text semantic search over image archives
  • Google Gemini Vision API integration (requires API key)
  • Results ranked with a match score (0–10) and AI-generated reasoning per image

Automatic Focus Detection

Evaluates the sharpness of uploaded face images using a 3-metric ensemble algorithm (Tenengrad, Crête blur score, FFT high-frequency ratio). It outputs a normalized 0–100 focus quality score that is highly resistant to content and scale biases, rejecting blurry enrollment photos automatically and prioritizing eye-level sharpness.


AI Image Editor

An AI-assisted image editing tool for performing operations like cropping, annotation, and face-aware adjustments on images. Integrated into the VisionLog pipeline to enable lightweight in-browser editing before analysis or enrollment.



Emotion Detection

Evaluates facial expressions in real-time or within static images using the DeepFace model. Integrating face detection with expression inference, it categorizes emotions (Happiness, Sadness, Anger, Surprise, Neutral) to append contextual emotional metadata to standard biometric identity logs.


Duplicate Selector

An operational tool designed to refine biometric datasets by surfacing potential duplicate images. It identifies image clusters with exceptionally high cosine similarity (>0.95), allowing human operators to manually resolve identity collisions (e.g., merge identities, delete redundant files, or explicitly differentiate subjects like twins).


Quick Start

Scenario: An organisation wants to automate daily attendance for 50 employees using a single CCTV camera at the main entrance, and also be able to search footage when a visitor is reported missing.


Step 1 — Build the Face Database

Before any recognition can happen, every employee's face must be enrolled. Capture 3–5 clear photos of each person and organise them into a folder structure where each sub-folder is named after the person:

enrollment_photos/
├── Alice_Sharma/
│   ├── front.jpg
│   └── profile.jpg
├── Bob_Menon/
│   └── bob_office.jpg
└── ...

Navigate to Training & Database → Enroll Faces and point the system at this root folder. The enrollment module extracts 512-dimensional ArcFace embeddings for each detected face and stores them in embeddings.pkl. This only needs to be done once per person (or whenever their appearance changes significantly).

See the Enrollment guide for field configuration and tips on photo quality.


Step 2 — Start the Live Attendance Feed

With the face database ready, go to Live Vision → Live Stream and connect the entrance CCTV (either via webcam device index or an RTSP URL):

rtsp://admin:password@192.168.1.64:554/stream1

The system begins the dual-stage pipeline immediately — YOLOv8 locates every person in each frame, ByteTrack assigns persistent track IDs, and InsightFace matches faces against the enrolled database. Recognised employees appear in the identification sidebar with their name, confidence score, and first-seen timestamp.

Attendance is logged automatically. Each recognised person is written to a CSV file (frame number, timestamp, track ID, name, confidence) with duplicate prevention — the same person won't be logged again until they leave and re-enter the frame.


Step 3 — Review Attendance Logs

At the end of the day, open the attendance CSV from the configured output directory. Each row represents a unique sighting:

FrameTimestampTrack IDNameConfidence
148209:03:147Alice_Sharma0.91
203109:07:5212Bob_Menon0.88

Cross-reference this with your HR system to generate a daily attendance report. Employees who appear zero times in the log were absent.


Step 4 — Search for a Specific Individual (Forensic Use)

Suppose security needs to confirm when a particular visitor was on-site. Go to Video Search → Target Search, upload a reference photo of the individual, and point the system at the recorded footage file. VisionLog will scan every frame, surface all timestamps where that face appears, and display a track log with confidence scores.

For offline batch use across a folder of images, use ⚗ Lab → Folder Person Search — upload the reference and select the image archive. Results are ranked by similarity score.


For detailed configuration of models, thresholds, and RTSP settings, see the Architecture and Enrollment pages.

On this page