SIMO Avatar
Vision-aware interactive AI avatar with voice, emotion detection, and gesture recognition. Built for Dubai AI Summit.
About the Project
SIMO is a real-time interactive AI avatar that sees, hears, thinks, and speaks. Built for a Dubai AI Summit demo, it uses GPT-4 Vision and MediaPipe to detect faces, read emotions, and recognize hand gestures. When someone approaches, it auto-greets them. It holds natural voice conversations using Whisper for speech-to-text, GPT-4o-mini for responses, and ElevenLabs for voice synthesis, all synced with HeyGen avatar lip movements.
Key Features
- GPT-4 Vision for real-time scene understanding, face detection, and badge reading
- MediaPipe + face-api.js for gesture recognition and emotion detection
- Real-time voice conversation via Whisper STT + ElevenLabs TTS + HeyGen lip sync
- Auto-greets visitors, reacts to hand waves and thumbs up
- Multi-language support: English, Arabic, French
- WebSocket-based real-time communication pipeline
Impact
Demo'd at Dubai AI Summit. Demonstrates real-time multi-modal AI interaction combining vision, voice, and natural language in a single coherent experience.
Tech Stack
Metrics
Interested in this project?
Let's discuss how I can build something similar for you.