Multimodal Sentiment Detector

Advanced emotion analysis system that processes video, audio, and facial expressions to detect human emotions with 89% accuracy.

Project Overview

A comprehensive multimodal sentiment detection system that analyzes uploaded videos to identify human emotions by combining computer vision, audio processing, and natural language understanding.

Key Capabilities:

• Real-time facial expression recognition
• Voice tone and prosody analysis
• Contextual understanding from speech
• Multi-modal fusion for improved accuracy
• Support for multiple video formats
• Confidence scoring and uncertainty quantification

Technical Architecture

Computer Vision Pipeline

Uses MediaPipe for face detection and landmark extraction, followed by a CNN trained on facial expression datasets. Processes 30 FPS with real-time performance.

Audio Processing

Extracts MFCC features, spectrograms, and prosodic features using Librosa. Combines with speech-to-text for contextual analysis using BERT embeddings.

Fusion Strategy

Late fusion approach combining predictions from visual, audio, and text modalities using weighted ensemble with learned attention mechanisms.

Performance Metrics

89%

Overall Accuracy

0.92

F1 Score

30ms

Inference Time

Emotion Classes

Technologies Used

OpenCV

MediaPipe

TensorFlow

Librosa

BERT

FastAPI

Python

NumPy

Try the Detector

Upload a video to analyze emotions

Click to upload video

Supports MP4, AVI, MOV formats