Shinse.AI - AI Voice Cloning Research

Project Overview

Abstract

High-quality data collection for AI is a significant cost barrier. Our project explores using synthetic audio to supplement real voice data. We investigate if this approach can create personalized, high-quality AI voices for teams with limited resources, ultimately making the technology more accessible.

Research Question & Hypothesis

Question: To what extent can synthetically generated audio improve the realism and similarity of an AI voice when trained on limited real-world data?

Hypothesis: We predict a model trained on a small real dataset combined with synthetic audio will produce a voice significantly more realistic and closer to the target speaker than a model trained on the small dataset alone.

Exploratory Data Analysis

Before training our models, we analyzed our source datasets to understand their core characteristics. This essential step involved examining clip durations, visualizing audio as spectrograms, and testing simple data augmentations.

Self-Collected Data Analysis

Duration distribution for self-collected data

Figure: This distribution of clip durations from our self-recorded data confirms that most clips are between two and ten seconds long, which is ideal for training.

Public Dataset Analysis

We performed the same analysis on the public LibriTTS and LJSpeech datasets to benchmark their properties and demonstrate augmentation techniques.

LibriTTS Dataset

Figure: Clip duration and word count for LibriTTS samples.

Audio Augmentation Demo

Original

Pitch Up (+2 Semitones)

Pitch Down (-2 Semitones)

LJSpeech Dataset

Figure: Clip duration and word count for LJSpeech samples.

Audio Augmentation Demo

Original

Pitch Up (+2 Semitones)

Pitch Down (-2 Semitones)

The RVC Process

Our project utilizes a Retrieval-Based Voice Conversion model, or RVC. This technique, popularized by tools like Applio, separates speech content from vocal timbre.

Step 1: Data Preparation

All audio clips are processed to a consistent format. This includes resampling to a 40kHz rate and normalizing volume. This step is critical for model stability.


# Prepare audio files for RVC training
def process_audio_for_rvc(input_path, output_path, target_sr=40000):
    wav, sr = librosa.load(input_path, sr=None)
    if sr != target_sr:
        wav = librosa.resample(wav, orig_sr=sr, target_sr=target_sr)
    wav = librosa.util.normalize(wav)
    sf.write(output_path, wav, target_sr)

Step 2: Feature Extraction

The model uses a Content Encoder like HuBERT to extract speaker-independent linguistic features. Simultaneously, a Timbre Encoder creates a unique vocal "fingerprint" using Mel-Frequency Cepstral Coefficients.


# Create a vocal fingerprint using MFCCs
def get_spectral_fingerprint(wav, n_mfcc=40):
    mfccs = librosa.feature.mfcc(y=wav, sr=16000, n_mfcc=n_mfcc)
    # The mean and std dev of coefficients form a vector of the voice's timbre.
    return np.concatenate((np.mean(mfccs, axis=1), np.std(mfccs, axis=1)))

Step 3: Transformer Fusion

A Transformer architecture takes the generic content features and "attends" to the target speaker's timbre features. This process learns the rules for applying the target voice's style to any given content.


# Simplified RVC Transformer block in PyTorch
class RVCBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads=8)
        self.feed_forward = nn.Linear(embed_dim, embed_dim)
        # Layer normalization is crucial for stability
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

Final Analysis and Results

Result Visualizations

MOS scores show a clear quality gap between models from professional versus self-recorded data.

Objective metrics confirm the trend. High-quality data from speaker David clusters in the ideal top-left.

Spectrograms show the 'Limited Real' model produces less spectral richness for speaker A1.

Audio Sample Comparison

Listen to the generated samples below; artifacts in the lower-quality models versus the clarity of the benchmarks.

Speaker A1

Benchmark: Extensive Real Data

Model 1: Limited Real Data Only

Model 2: Limited + Bootstrapping

Speaker David: LibriTTS Dataset

Ground Truth: Original Audio

Extensive Real Data (Generated)

Limited Real Data (Generated)

Conclusion

The success of AI voice cloning is critically dependent on the quality of the source audio. Pristine datasets yield excellent results, even with limited data. Self-recorded audio with minor imperfections leads to lower-quality clones because the model amplifies these flaws. While simple data augmentation like bootstrapping offers slight improvement, it cannot overcome poor source quality.

Our findings lead to a nuanced conclusion. Synthetic augmentation can help, but it cannot fix fundamentally flawed source data. For teams with limited resources, the focus must be on capturing a small amount of extremely high-quality data. The adage "garbage in, garbage out" remains paramount in this domain.

Project Files & Team

Project Files

Access the core documents of our project on GitHub. Note: a GitHub account may be required to view notebook files.

Project README Final Report Notebook Full Proposal (Markdown)

Team Contributions

Kliment Ho: Research concept, experimental design, dataset curation, RVC model research, writing.
Jenny Xu: Data management and processing, EDA, objective metric calculation, writing.
Wendy Du: Experiment execution, synthetic audio generation, video editing, writing.