1. Building and Deploying a Real-Time Voice-to-Voice Chatbot: The Journey of VoiceBuddy

Generative AI is revolutionizing the tech landscape, enabling us to develop and deploy applications that were once unimaginable, prohibitively expensive and extremely time consuming. Today, I'm excited to share the journey of how I developed and deployed my real-time voice-to-voice chat application, VoiceBuddy. This article will cover the development and deployment process, focusing on the tools, challenges, and solutions involved.

Here is my app link

https://huggingface.co/spaces/mfahadkhan/VoiceBuddy

Features of App

Real-Time Voice-to-Text Conversion

Customizable Voice Output

LLM Role Customization
Interactive Chat History
Real-Time Processing Feedback
Gradio UI Components
Real-Time Response Generation

Development Process

1. The Idea: VoiceBuddy

The journey began with a simple yet ambitious idea: to build a real-time voice-to-voice chat application. The concept was to create an AI-powered bot that could take spoken input, process it using a large language model (LLM), and then generate a spoken response—all in real-time.

2. Breaking Down the Problem

To turn this idea into reality, I needed to break down the problem into manageable parts:

Voice to Text: Convert spoken input into text.
Text to LLM: Pass the text input through a large language model.
LLM to Response Text: Generate a text-based response from the LLM.
Text to Voice: Convert the response text back into speech.

3. Choosing the Right Tools and Models

Following are some key site

To implement this pipeline, I explored various tools and models:

Speech-to-Text: OpenAI's Whisper model, available on Hugging Face, was ideal for transcribing speech into text.
Large Language Model (LLM): I selected the LLaMA model from Groq for generating the AI's responses.
Text-to-Speech: Google's gTTS (Google Text-to-Speech) was chosen to convert the generated text back into speech.
For developing and testing : I choose Google colab as it provides free virtual resources that are powerful and fast.

4. Setting Up Groq for LLM Access

To use the LLaMA model through Groq, I needed to create an API key. This key would allow me to access the LLM and integrate it into my application.

Go to link for generating key and reading documentation

https://console.groq.com/keys

5. Coding with ChatGPT

The next step was to bring all the components together. I used Google Colab for development, as it provides virtual resources like GPUs and RAM, which are essential for running models efficiently.

To streamline the coding process, I leveraged ChatGPT. I provided it with a detailed prompt specifying my task, tools, and models, along with references to the necessary documentation. Here’s an example of how I framed my prompt:

" I am working on Google Colab and building voice to voice chat bot realtime application. Use openai whisper model for my input transcpritio , llama from groq as llm , gtts for text to speech . this application should be in real time . here is documentation from Groq "

pythonCopy codeimport os
from groq import Groq

client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of fast language models",
        }
    ],
    model="llama3-8b-8192",
)

print(chat_completion.choices[0].message.content)

6. Iterative Development and Troubleshooting

The real challenge began once I started coding. ChatGPT provided me with the initial code, which I executed in Google Colab. As expected, I encountered various issues, such as missing libraries, incompatible formats, and unexpected outputs.

The process became iterative: running the code, identifying errors, seeking help from ChatGPT, adjusting the code, and repeating this cycle until the application was fully functional. This trial-and-error approach was crucial in refining the bot to meet my specific needs.

7. Enhancements and Customization

Once the basic functionality was in place, I began experimenting with additional features. This included improving the user interface with Gradio, adding customization options for tone and role selection, and making the interaction more engaging and relatable.

The final code for VoiceBuddy included these enhancements, making it not just functional but also interactive and user-friendly.

import whisper
from groq import Groq
from gtts import gTTS
import os
import gradio as gr

# Initialize Groq client
client = Groq(api_key=os.environ.get("YOUR API KEY "))

# Load Whisper model
model = whisper.load_model("base")

# Function to modify the voice style based on user selection or default
def modify_voice_style(response, voice_style="default"):
    if voice_style == "male_serious":
        return gTTS(text=response, lang='en', tld='co.uk', slow=True)  # Male serious voice
    elif voice_style == "female_serious":
        return gTTS(text=response, lang='en', tld='com.au', slow=True)  # Female serious voice
    elif voice_style == "male_funny":
        return gTTS(text=response, lang='en', tld='co.in', slow=False)  # Male funny voice
    elif voice_style == "female_funny":
        return gTTS(text=response, lang='en', tld='com', slow=False)  # Female funny voice
    elif voice_style == "male_friendly":
        return gTTS(text=response, lang='en', tld='co.uk', slow=False)  # Male friendly voice
    elif voice_style == "female_friendly":
        return gTTS(text=response, lang='en', tld='ca', slow=False)  # Female friendly voice
    else:
        return gTTS(text=response, lang='en')  # Default voice

# Function to set the role of the LLM based on user selection or default
def set_llm_role(transcription, role="default"):
    if role == "friend":
        return f"Hey, buddy (you are my friend)! {transcription}"
    elif role == "junior":
        return f"Hey, listen up, kid! (you are my junior)! {transcription}"
    elif role == "senior":
        return f"Hello there, You are my senior. {transcription}"
    elif role == "parent":
        return f"You are like my parent: {transcription}"
    elif role == "teacher":
        return f"You are my teacher: {transcription}"
    else:
        return transcription  # Default behavior

# Function to process the audio and generate a response
def chatbot_pipeline(audio, voice_style="default", role="default", chat_history=[]):
    # Update to show processing message
    processing_message = "Processing... Please wait."

    # Transcribe audio using Whisper
    result = model.transcribe(audio)
    transcription = result['text']

    # Set the role of the LLM based on user selection or default
    modified_transcription = set_llm_role(transcription, role)

    # Update chat history with user's input
    chat_history.append({"role": "user", "content": modified_transcription})

    # Generate response using LLaMA
    chat_completion = client.chat.completions.create(
        messages=chat_history,
        model="llama3-8b-8192",
    )
    response = chat_completion.choices[0].message.content

    # Update chat history with the assistant's response
    chat_history.append({"role": "assistant", "content": response})

    # Convert response to speech using the selected or default voice style
    tts = modify_voice_style(response, voice_style)
    output_audio_path = "response.mp3"
    tts.save(output_audio_path)

    return transcription, response, output_audio_path, chat_history, ""

# Gradio UI
def gradio_interface(audio, voice_style, role, chat_history):
    # Show processing message
    transcription, response, output_audio_path, chat_history, processing_message = chatbot_pipeline(audio, voice_style, role, chat_history)
    return transcription, response, output_audio_path, chat_history, processing_message

# Create Gradio interface with enhanced design
interface = gr.Blocks()

with interface:
    with gr.Row():
        gr.Markdown("# VoiceBuddy")

    # Short description about usage
    gr.Markdown("Record or upload your audio file to chat with this buddy.")

    with gr.Row():
        # Sidebar for Transcription and LLaMA Response
        with gr.Column(scale=1):
            transcription_output = gr.Textbox(label="Transcription", interactive=False)
            response_output = gr.Textbox(label="LLaMA Response", interactive=False)

        # Main area for Audio Input and Output
        with gr.Column(scale=2):
            audio_input = gr.Audio(type="filepath", label="Record or Upload Audio", autoplay=False)

            # Options for voice style and role, with defaults
            voice_style = gr.Dropdown(
                choices=["default", "male_serious", "female_serious", "male_funny", "female_funny", "male_friendly", "female_friendly"],
                label="Select Voice Style",
                value="default"  # Set the default option
            )
            role = gr.Dropdown(
                choices=["default", "friend", "junior", "senior", "parent", "teacher"],
                label="Select LLM Role",
                value="default"  # Set the default option
            )

            audio_output = gr.Audio(label="Response Audio", autoplay=True)
            processing_message = gr.Textbox(value="Processing... Please wait.", label="Status", interactive=False)

    # Initialize chat history
    chat_history = gr.State([])

    # Update outputs when audio input changes
    audio_input.change(
        fn=gradio_interface,
        inputs=[audio_input, voice_style, role, chat_history],
        outputs=[transcription_output, response_output, audio_output, chat_history, processing_message],
        queue=True
    )

# Launch Gradio interface
interface.launch()

Deployment Process

After successfully developing the VoiceBuddy chatbot, the next step was to deploy it so that others could interact with it in real-time. For deployment, I chose Hugging Face Spaces, which offers a free and user-friendly platform to host models and applications, ensuring they are accessible around the clock.

1. Setting Up the Deployment Environment

Deploying the application on Hugging Face Spaces involves a few key steps:

Main Application File (app.py): First, I created a main app.py file containing all the code necessary to run VoiceBuddy. This file serves as the heart of the application, managing everything from input processing to generating responses.
Dependencies File (requirements.txt): Next, I created a requirements.txt file. This file lists all the dependencies and packages that VoiceBuddy relies on, such as Hugging Face libraries, Gradio for the UI, and the specific models used for transcription and text-to-speech conversion.
Environment Variables: I then set up the necessary environment variables, such as API keys and configuration settings, ensuring the application could securely access the required resources.

2. Deployment on Hugging Face Spaces

With these files in place, I uploaded them to my Hugging Face Space. The platform automatically recognized the structure of the application and began the build process. Hugging Face Spaces handles the heavy lifting by:

Installing all dependencies listed in the requirements.txt file.
Setting up the environment as specified.
Building and launching the application.

3. Going Live

Once the build process was complete, my VoiceBuddy chatbot was live and accessible to anyone with the link. I shared the link with friends and colleagues, who could now interact with the bot in real-time, experiencing the different tones and roles I had programmed.