VoiceSementle / README.md
Sungjoon Lee
[STYLE] ์•„ํ‚ค ์ˆ˜์ •
4a8de28

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: VOICE SEMENTLE
emoji: ๐ŸŽ™๏ธ
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.0.0
app_file: client/app.py
pinned: false
tags:
  - mcp-in-action-track-creative

๐ŸŽ™๏ธ Voice Sementle

Daily voice puzzle game โ€” guess the meme, song, or movie quote, but you have to SAY IT RIGHT!

It's not just what you say, it's how you say it. Your pitch, rhythm, energy, and pronunciation all matter.

๐Ÿ—“๏ธ New puzzle every day โ€ข ๐ŸŽญ 3 genres (memes, songs, movies) โ€ข ๐Ÿง  AI hints that get smarter


๐Ÿ“‹ Submission Info

Track MCP in Action โ€” Creative
MCP Used VoiceKit MCP
LLM Google Gemini 2.5 Flash
Voice AI ElevenLabs (Voice Cloning + TTS)
Framework Gradio 6.0

๐Ÿ“ข Social Post: View on LinkedIn ๐Ÿ“ข Social Post: View on X ๐ŸŽฌ Demo Video: Watch (1-5 min) ๐Ÿ‘ฅ Team: @LisaVLee, @SabaPivot, @daheepk, @tchoi911, @Lucian25


โœ… Track 2 Requirements

Requirement How We Fulfill It
Autonomous Agent Two agents: MCP Advisor (voice analysis) + Chatbot (text + audio hints)
MCP as Tools VoiceKit MCP (voicekit_analyze_voice_similarity) for voice analysis
Gradio App Built with Gradio 6.0
Tool Calling Chatbot autonomously calls generate_audio_hint โ†’ ElevenLabs TTS

๐ŸŽฎ How It Works

1. ๐ŸŽฏ Daily puzzle loads (meme / song / movie quote)
2. ๐ŸŽค You record your voice guess
3. ๐Ÿ”Š MCP analyzes: pitch, rhythm, energy, pronunciation, transcript
4. ๐Ÿง  Gemini agent generates progressive hints (vague โ†’ specific)
5. ๐Ÿ”Š Ask for audio hint โ†’ Agent calls ElevenLabs TTS with voice cloning
6. ๐Ÿ† Score > 85 = WIN!

๐Ÿค– Agentic Architecture (Two Agents)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         VOICE SEMENTLE                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                  โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ–ผ                                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    AGENT 1: MCP Advisor     โ”‚             โ”‚ AGENT 2: Chatbot + Tools    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค             โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                             โ”‚             โ”‚                             โ”‚
โ”‚    ๐ŸŽค User Voice            โ”‚             โ”‚     ๐Ÿ’ฌ User Chat             โ”‚
โ”‚          โ”‚                  โ”‚             โ”‚          โ”‚                  โ”‚
โ”‚          โ–ผ                  โ”‚             โ”‚          โ–ผ                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚             โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚
โ”‚  โ”‚  VoiceKit MCP โ”‚          โ”‚             โ”‚  โ”‚ Gemini 2.5    โ”‚          โ”‚
โ”‚  โ”‚  (SSE Server) โ”‚          โ”‚             โ”‚  โ”‚    Flash      โ”‚          โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚             โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚
โ”‚          โ”‚                  โ”‚             โ”‚          โ”‚                  โ”‚
โ”‚          โ–ผ                  โ”‚             โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚    6 Voice Scores           โ”‚             โ”‚    โ–ผ           โ–ผ            โ”‚
โ”‚    (pitch, rhythm,          โ”‚             โ”‚  Text      Tool Call        โ”‚
โ”‚     energy, etc.)           โ”‚             โ”‚  Response  (autonomous)     โ”‚
โ”‚          โ”‚                  โ”‚             โ”‚                โ”‚            โ”‚
โ”‚          โ–ผ                  โ”‚             โ”‚                โ–ผ            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚             โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ Gemini 2.5    โ”‚          โ”‚             โ”‚  โ”‚  generate_audio_hint  โ”‚  โ”‚
โ”‚  โ”‚    Flash      โ”‚          โ”‚             โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚             โ”‚              โ”‚              โ”‚
โ”‚          โ”‚                  โ”‚             โ”‚              โ–ผ              โ”‚
โ”‚          โ–ผ                  โ”‚             โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   Progressive Advice        โ”‚             โ”‚  โ”‚     ElevenLabs        โ”‚  โ”‚
โ”‚   (based on attempt #)      โ”‚             โ”‚  โ”‚  IVC + TTS Engine     โ”‚  โ”‚
โ”‚                             โ”‚             โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚              โ”‚              โ”‚
                                            โ”‚              โ–ผ              โ”‚
                                            โ”‚       ๐Ÿ”Š Audio Hint         โ”‚
                                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Agent 1: MCP Advisor

Analyzes voice via VoiceKit MCP and generates advice with Gemini 2.5 Flash.

  • Connects to MCP server (voicekit_analyze_voice_similarity)
  • Returns 6 scores: pitch, rhythm, energy, pronunciation, transcript, overall
  • Gemini 2.5 Flash generates progressive advice based on scores & attempt count

Progressive Advice Strategy:

  • Attempt 1: Extremely vague (no category revealed)
  • Attempt 2: Vague hint + category mentioned
  • Attempts 3-4: More specific context
  • Attempts 5-6: Quite specific (era, usage)
  • Attempts 7-10: Very specific (syllables, first letter, rhymes)
  • Attempt 11+: Pronunciation coaching mode

Agent 2: Chatbot (with Tool Calling)

Conversational chatbot powered by Gemini 2.5 Flash that provides text hints AND can autonomously call tools.

  • Answers user questions about the game
  • Provides additional hints on request
  • Tool calling: Autonomously decides to call generate_audio_hint โ†’ ElevenLabs TTS

๐Ÿ”Š Audio Hints with ElevenLabs

The agent has access to generate_audio_hint and autonomously decides when to use it:

# User: "Can I hear how it sounds?"
# Agent decides to call tool:
generate_audio_hint(hint_type="syllable")
  โ†’ Clone voice from reference audio (ElevenLabs IVC)
  โ†’ Generate TTS with eleven_multilingual_v2
  โ†’ Return audio to user

ElevenLabs Features Used:

  • ๐ŸŽญ Instant Voice Cloning (IVC) โ€” Clone voice from reference audio
  • ๐Ÿ—ฃ๏ธ eleven_multilingual_v2 โ€” High-quality multilingual TTS
  • ๐Ÿ”Š Voice Library โ€” Consistent character voices for hints

๐Ÿ› ๏ธ Tech Stack

Component Technology
Frontend Gradio 6.0
Voice Analysis VoiceKit MCP (SSE)
LLM Agent Google Gemini 2.5 Flash
Audio Hints ElevenLabs IVC + TTS
Database PostgreSQL

๐Ÿ“Š Scoring (6 Metrics)

Metric What It Measures
๐ŸŽต Pitch Tone accuracy
๐Ÿฅ Rhythm Timing & cadence
โšก Energy Intensity level
๐Ÿ—ฃ๏ธ Pronunciation Clarity
๐Ÿ“ Transcript Correct words (STT)
๐Ÿ† Overall Combined (>85 = win)

๐ŸŽฏ Why Voice Sementle?

Judging Criteria Our Approach
UI/UX Polished Gradio 6 interface, intuitive game flow
Functionality MCP + Gemini Agentic chatbot + ElevenLabs Tool calling
Creativity First voice-based guessing game with performance scoring
Documentation Clear README, architecture diagrams
Real-world Impact Fun consumer app; language learning potential

๐ŸŽฎ Try It Now!

๐Ÿ‘† Click the interface above to start playing!

  1. Allow microphone access
  2. Record your voice guess
  3. Get scored on pitch, rhythm, energy & pronunciation
  4. Ask for hints or audio examples
  5. Keep trying until you win!

Built for MCP's 1st Birthday Hackathon ๐ŸŽ‚

Celebrating one year of Model Context Protocol!