COGS108 Final Project by Kliment Ho, Jenny Xu, and Wendy Du.
Exploring the viability of synthetic audio to bridge the gap in low-resource voice cloning.
High-quality data collection for AI is a significant cost barrier. Our project explores using synthetic audio to supplement real voice data. We investigate if this approach can create personalized, high-quality AI voices for teams with limited resources, ultimately making the technology more accessible.
Question: To what extent can synthetically generated audio improve the realism and similarity of an AI voice when trained on limited real-world data?
Hypothesis: We predict a model trained on a small real dataset combined with synthetic audio will produce a voice significantly more realistic and closer to the target speaker than a model trained on the small dataset alone.
Before training our models, we analyzed our source datasets to understand their core characteristics. This essential step involved examining clip durations, visualizing audio as spectrograms, and testing simple data augmentations.
Figure: This distribution of clip durations from our self-recorded data confirms that most clips are between two and ten seconds long, which is ideal for training.
We performed the same analysis on the public LibriTTS and LJSpeech datasets to benchmark their properties and demonstrate augmentation techniques.
Figure: Clip duration and word count for LibriTTS samples.
Original
Pitch Up (+2 Semitones)
Pitch Down (-2 Semitones)
Figure: Clip duration and word count for LJSpeech samples.
Original
Pitch Up (+2 Semitones)
Pitch Down (-2 Semitones)
Our project utilizes a Retrieval-Based Voice Conversion model, or RVC. This technique, popularized by tools like Applio, separates speech content from vocal timbre.
All audio clips are processed to a consistent format. This includes resampling to a 40kHz rate and normalizing volume. This step is critical for model stability.
# Prepare audio files for RVC training
def process_audio_for_rvc(input_path, output_path, target_sr=40000):
wav, sr = librosa.load(input_path, sr=None)
if sr != target_sr:
wav = librosa.resample(wav, orig_sr=sr, target_sr=target_sr)
wav = librosa.util.normalize(wav)
sf.write(output_path, wav, target_sr)
The model uses a Content Encoder like HuBERT to extract speaker-independent linguistic features. Simultaneously, a Timbre Encoder creates a unique vocal "fingerprint" using Mel-Frequency Cepstral Coefficients.
# Create a vocal fingerprint using MFCCs
def get_spectral_fingerprint(wav, n_mfcc=40):
mfccs = librosa.feature.mfcc(y=wav, sr=16000, n_mfcc=n_mfcc)
# The mean and std dev of coefficients form a vector of the voice's timbre.
return np.concatenate((np.mean(mfccs, axis=1), np.std(mfccs, axis=1)))
A Transformer architecture takes the generic content features and "attends" to the target speaker's timbre features. This process learns the rules for applying the target voice's style to any given content.
# Simplified RVC Transformer block in PyTorch
class RVCBlock(nn.Module):
def __init__(self, embed_dim):
super().__init__()
self.attention = nn.MultiheadAttention(embed_dim, num_heads=8)
self.feed_forward = nn.Linear(embed_dim, embed_dim)
# Layer normalization is crucial for stability
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)

MOS scores show a clear quality gap between models from professional versus self-recorded data.

Objective metrics confirm the trend. High-quality data from speaker David clusters in the ideal top-left.

Spectrograms show the 'Limited Real' model produces less spectral richness for speaker A1.
Listen to the generated samples below; artifacts in the lower-quality models versus the clarity of the benchmarks.
Benchmark: Extensive Real Data
Model 1: Limited Real Data Only
Model 2: Limited + Bootstrapping
Ground Truth: Original Audio
Extensive Real Data (Generated)
Limited Real Data (Generated)
The success of AI voice cloning is critically dependent on the quality of the source audio. Pristine datasets yield excellent results, even with limited data. Self-recorded audio with minor imperfections leads to lower-quality clones because the model amplifies these flaws. While simple data augmentation like bootstrapping offers slight improvement, it cannot overcome poor source quality.
Our findings lead to a nuanced conclusion. Synthetic augmentation can help, but it cannot fix fundamentally flawed source data. For teams with limited resources, the focus must be on capturing a small amount of extremely high-quality data. The adage "garbage in, garbage out" remains paramount in this domain.
Access the core documents of our project on GitHub. Note: a GitHub account may be required to view notebook files.