Spaces:

MCP-1st-Birthday
/

VoiceSementle

Running

App Files Files Community

VoiceSementle / README.md

Sungjoon Lee

[STYLE] 아키 수정

4a8de28 16 days ago

preview code

raw

history blame contribute delete

9.69 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: VOICE SEMENTLE
emoji: 🎙️
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.0.0
app_file: client/app.py
pinned: false
tags:
  - mcp-in-action-track-creative

🎙️ Voice Sementle

Daily voice puzzle game — guess the meme, song, or movie quote, but you have to SAY IT RIGHT!

It's not just what you say, it's how you say it. Your pitch, rhythm, energy, and pronunciation all matter.

🗓️ New puzzle every day • 🎭 3 genres (memes, songs, movies) • 🧠 AI hints that get smarter

📋 Submission Info


Track	MCP in Action — Creative
MCP Used	VoiceKit MCP
LLM	Google Gemini 2.5 Flash
Voice AI	ElevenLabs (Voice Cloning + TTS)
Framework	Gradio 6.0

📢 Social Post: View on LinkedIn 📢 Social Post: View on X 🎬 Demo Video: Watch (1-5 min) 👥 Team: @LisaVLee, @SabaPivot, @daheepk, @tchoi911, @Lucian25

✅ Track 2 Requirements

Requirement	How We Fulfill It
Autonomous Agent	Two agents: MCP Advisor (voice analysis) + Chatbot (text + audio hints)
MCP as Tools	VoiceKit MCP (`voicekit_analyze_voice_similarity`) for voice analysis
Gradio App	Built with Gradio 6.0
Tool Calling	Chatbot autonomously calls `generate_audio_hint` → ElevenLabs TTS

🎮 How It Works

1. 🎯 Daily puzzle loads (meme / song / movie quote)
2. 🎤 You record your voice guess
3. 🔊 MCP analyzes: pitch, rhythm, energy, pronunciation, transcript
4. 🧠 Gemini agent generates progressive hints (vague → specific)
5. 🔊 Ask for audio hint → Agent calls ElevenLabs TTS with voice cloning
6. 🏆 Score > 85 = WIN!

🤖 Agentic Architecture (Two Agents)

┌─────────────────────────────────────────────────────────────────────┐
│                         VOICE SEMENTLE                              │
└─────────────────────────────────────────────────────────────────────┘
                                  │
          ┌───────────────────────┴───────────────────────┐
          ▼                                               ▼
┌─────────────────────────────┐             ┌─────────────────────────────┐
│    AGENT 1: MCP Advisor     │             │ AGENT 2: Chatbot + Tools    │
├─────────────────────────────┤             ├─────────────────────────────┤
│                             │             │                             │
│    🎤 User Voice            │             │     💬 User Chat             │
│          │                  │             │          │                  │
│          ▼                  │             │          ▼                  │
│  ┌───────────────┐          │             │  ┌───────────────┐          │
│  │  VoiceKit MCP │          │             │  │ Gemini 2.5    │          │
│  │  (SSE Server) │          │             │  │    Flash      │          │
│  └───────┬───────┘          │             │  └───────┬───────┘          │
│          │                  │             │          │                  │
│          ▼                  │             │    ┌─────┴─────┐            │
│    6 Voice Scores           │             │    ▼           ▼            │
│    (pitch, rhythm,          │             │  Text      Tool Call        │
│     energy, etc.)           │             │  Response  (autonomous)     │
│          │                  │             │                │            │
│          ▼                  │             │                ▼            │
│  ┌───────────────┐          │             │  ┌───────────────────────┐  │
│  │ Gemini 2.5    │          │             │  │  generate_audio_hint  │  │
│  │    Flash      │          │             │  └───────────┬───────────┘  │
│  └───────┬───────┘          │             │              │              │
│          │                  │             │              ▼              │
│          ▼                  │             │  ┌───────────────────────┐  │
│   Progressive Advice        │             │  │     ElevenLabs        │  │
│   (based on attempt #)      │             │  │  IVC + TTS Engine     │  │
│                             │             │  └───────────┬───────────┘  │
└─────────────────────────────┘             │              │              │
                                            │              ▼              │
                                            │       🔊 Audio Hint         │
                                            └─────────────────────────────┘

Agent 1: MCP Advisor

Analyzes voice via VoiceKit MCP and generates advice with Gemini 2.5 Flash.

Connects to MCP server (voicekit_analyze_voice_similarity)
Returns 6 scores: pitch, rhythm, energy, pronunciation, transcript, overall
Gemini 2.5 Flash generates progressive advice based on scores & attempt count

Progressive Advice Strategy:

Attempt 1: Extremely vague (no category revealed)
Attempt 2: Vague hint + category mentioned
Attempts 3-4: More specific context
Attempts 5-6: Quite specific (era, usage)
Attempts 7-10: Very specific (syllables, first letter, rhymes)
Attempt 11+: Pronunciation coaching mode

Agent 2: Chatbot (with Tool Calling)

Conversational chatbot powered by Gemini 2.5 Flash that provides text hints AND can autonomously call tools.

Answers user questions about the game
Provides additional hints on request
Tool calling: Autonomously decides to call generate_audio_hint → ElevenLabs TTS

🔊 Audio Hints with ElevenLabs

The agent has access to generate_audio_hint and autonomously decides when to use it:

# User: "Can I hear how it sounds?"
# Agent decides to call tool:
generate_audio_hint(hint_type="syllable")
  → Clone voice from reference audio (ElevenLabs IVC)
  → Generate TTS with eleven_multilingual_v2
  → Return audio to user

ElevenLabs Features Used:

🎭 Instant Voice Cloning (IVC) — Clone voice from reference audio
🗣️ eleven_multilingual_v2 — High-quality multilingual TTS
🔊 Voice Library — Consistent character voices for hints

🛠️ Tech Stack

Component	Technology
Frontend	Gradio 6.0
Voice Analysis	VoiceKit MCP (SSE)
LLM Agent	Google Gemini 2.5 Flash
Audio Hints	ElevenLabs IVC + TTS
Database	PostgreSQL

📊 Scoring (6 Metrics)

Metric	What It Measures
🎵 Pitch	Tone accuracy
🥁 Rhythm	Timing & cadence
⚡ Energy	Intensity level
🗣️ Pronunciation	Clarity
📝 Transcript	Correct words (STT)
🏆 Overall	Combined (>85 = win)

🎯 Why Voice Sementle?

Judging Criteria	Our Approach
UI/UX	Polished Gradio 6 interface, intuitive game flow
Functionality	MCP + Gemini Agentic chatbot + ElevenLabs Tool calling
Creativity	First voice-based guessing game with performance scoring
Documentation	Clear README, architecture diagrams
Real-world Impact	Fun consumer app; language learning potential

🎮 Try It Now!

👆 Click the interface above to start playing!

Allow microphone access
Record your voice guess
Get scored on pitch, rhythm, energy & pronunciation
Ask for hints or audio examples
Keep trying until you win!

Built for MCP's 1st Birthday Hackathon 🎂

Celebrating one year of Model Context Protocol!