Back to Projects

Multimodal Sentiment Detector

Advanced emotion analysis system that processes video, audio, and facial expressions to detect human emotions with 89% accuracy.

Project Overview

A comprehensive multimodal sentiment detection system that analyzes uploaded videos to identify human emotions by combining computer vision, audio processing, and natural language understanding.

Key Capabilities:

  • • Real-time facial expression recognition
  • • Voice tone and prosody analysis
  • • Contextual understanding from speech
  • • Multi-modal fusion for improved accuracy
  • • Support for multiple video formats
  • • Confidence scoring and uncertainty quantification
Technical Architecture

Computer Vision Pipeline

Uses MediaPipe for face detection and landmark extraction, followed by a CNN trained on facial expression datasets. Processes 30 FPS with real-time performance.

Audio Processing

Extracts MFCC features, spectrograms, and prosodic features using Librosa. Combines with speech-to-text for contextual analysis using BERT embeddings.

Fusion Strategy

Late fusion approach combining predictions from visual, audio, and text modalities using weighted ensemble with learned attention mechanisms.

Performance Metrics
89%
Overall Accuracy
0.92
F1 Score
30ms
Inference Time
7
Emotion Classes
Technologies Used
OpenCV
MediaPipe
TensorFlow
Librosa
BERT
FastAPI
Python
NumPy
Try the Detector
Upload a video to analyze emotions