AI Speech-to-Text on Linux - Complete Guide

Comprehensive guide to setting up AI-powered speech-to-text on Linux using OpenAI Whisper, Vosk, and other tools for accurate audio transcription

Introduction

Speech-to-text technology has revolutionized how we interact with computers. With modern AI models like OpenAI Whisper, you can achieve near-human accuracy for transcription on your Linux machine. This guide covers multiple solutions from local AI models to cloud services, all running on Linux.

Why Use AI Speech-to-Text on Linux?

Privacy - Process audio locally without sending to cloud services
Offline capability - Work without internet connection
Cost-effective - No subscription fees for local models
Customization - Fine-tune models for specific domains
Integration - Easy integration with Linux workflows and scripts

Option 1: OpenAI Whisper (Recommended)

Whisper is OpenAI's open-source speech recognition model with exceptional accuracy across multiple languages. It's trained on 680,000 hours of multilingual data, making it robust to accents, background noise, and technical language.

Key Features

Multilingual Support: 99 languages with automatic language detection
Multitask Model: Speech recognition, translation, and language identification
High Accuracy: Reduces word error rates by 50% compared to other models
Robust Performance: Works well with background noise and technical language
Translation: Can translate non-English speech to English

Installation

		# Install Python and pip if not already installed
sudo pacman -S python python-pip  # Arch
sudo apt install python3 python3-pip  # Ubuntu/Debian
 
# Install ffmpeg for audio processing
sudo pacman -S ffmpeg  # Arch
sudo apt install ffmpeg  # Ubuntu/Debian
 
# Install Whisper
pip install -U openai-whisper

Available Models

Model	Parameters	English-only	Multilingual	Required VRAM	Relative Speed
tiny	39 M	✓	✓	~1 GB	~32x
base	74 M	✓	✓	~1 GB	~16x
small	244 M	✓	✓	~2 GB	~6x
medium	769 M	✓	✓	~5 GB	~2x
large	1550 M	✗	✓	~10 GB	1x
turbo	809 M	✗	✓	~6 GB	~8x

The turbo model is an optimized version of large-v3 offering faster transcription with minimal accuracy loss.

Basic Usage

		# Transcribe an audio file
whisper audio.mp3
 
# Specify model size (tiny, base, small, medium, large, turbo)
whisper audio.mp3 --model medium
 
# Output to specific format
whisper audio.mp3 --output_format txt
 
# Transcribe with timestamps
whisper audio.mp3 --output_format srt
 
# Specify language for better accuracy
whisper audio.mp3 --language English
 
# Translate to English
whisper audio.mp3 --task translate

Python API

		import whisper
 
# Load model
model = whisper.load_model("turbo")
 
# Transcribe
result = model.transcribe("audio.mp3")
 
# Print result
print(result["text"])
 
# Get detailed segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

Hybrid Whisper-Vosk Real-Time Transcription

For applications requiring both speed and accuracy, consider a hybrid approach combining Whisper and Vosk. This method uses Vosk for fast real-time transcription with Whisper running in the background to correct errors.

How It Works

Vosk provides real-time transcription via WebSocket for immediate feedback
Whisper processes the same audio in the background with a short delay
Compare outputs using Levenshtein distance to detect significant differences
Automatically correct VOSK's output when Whisper disagrees

Implementation Example

		import vosk
import whisper
import asyncio
import Levenshtein
from vosk import KaldiRecognizer
 
class HybridTranscriber:
    def __init__(self):
        # Initialize Vosk for real-time
        self.vosk_model = vosk.Model("vosk-model-small-en-us")
        self.recognizer = KaldiRecognizer(self.vosk_model, 16000)
        
        # Initialize Whisper for accuracy checking
        self.whisper_model = whisper.load_model("base")
        
        # Audio buffer for Whisper
        self.audio_buffer = []
        self.correction_delay = 2.0  # seconds
        
    async def transcribe_with_corrections(self, audio_stream):
        vosk_text = ""
        whisper_corrections = []
        
        # Start both transcription processes
        vosk_task = asyncio.create_task(self._vosk_transcribe(audio_stream))
        whisper_task = asyncio.create_task(self._whisper_correct(audio_stream))
        
        # Process results
        while True:
            vosk_result = await vosk_task
            whisper_result = await whisper_task
            
            if vosk_result:
                vosk_text += vosk_result
                print(f"VOSK: {vosk_result}")
            
            if whisper_result:
                # Check if correction is needed
                distance = Levenshtein.distance(vosk_text[-len(whisper_result):], whisper_result)
                if distance > len(whisper_result) * 0.3:  # 30% difference threshold
                    print(f"WHISPER CORRECTION: {vosk_result} -> {whisper_result}")
                    vosk_text = vosk_text[:-len(vosk_result)] + whisper_result
    
    async def _vosk_transcribe(self, audio_stream):
        # Real-time Vosk transcription
        while True:
            data = await audio_stream.read(4000)
            if self.recognizer.AcceptWaveform(data):
                result = json.loads(self.recognizer.Result())
                return result["text"]
    
    async def _whisper_correct(self, audio_stream):
        # Background Whisper correction
        await asyncio.sleep(self.correction_delay)
        # Process accumulated audio with Whisper
        result = self.whisper_model.transcribe("temp_audio.wav")
        return result["text"]

This hybrid approach provides:

Immediate feedback from Vosk (real-time)
High accuracy corrections from Whisper (1-2 second delay)
Visual indicators when corrections are applied
Trust scoring to weigh model confidence

Model Comparison: Whisper vs Alternatives

Based on comprehensive benchmarks, here's how Whisper compares to other open-source transcription models:

Accuracy Comparison

Model	Word Error Rate	Strengths	Limitations
Whisper Large	~5-10%	State-of-the-art accuracy, multilingual, robust to noise	High resource requirements
Whisper Medium	~10-15%	Good balance of accuracy/speed	Still resource-intensive
Whisper Small	~15-25%	Fast, good for most applications	Lower accuracy on complex audio
Vosk	~15-30%	Fast, lightweight, real-time capable	Limited language support
Kaldi	~10-20%	Highly customizable, accurate	Complex setup, steep learning curve
Coqui STT	~15-25%	Community-driven, multilingual	Maintenance mode, limited updates

Setup Complexity

Whisper: Simple pip install, works out-of-the-box
Vosk: Easy download + pip install, minimal setup
Kaldi: Complex installation, requires technical expertise
Coqui STT: Moderate setup complexity

Performance Metrics

Whisper Turbo: 8x faster than Large with minimal accuracy loss
Vosk: Real-time performance on low-end hardware
Faster-Whisper: Up to 4x faster than original Whisper
Whisper.cpp: Optimized for CPU inference

Real-time Microphone Transcription

		import whisper
import pyaudio
import wave
import tempfile
import os
 
model = whisper.load_model("base")
 
# Audio recording parameters
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
RECORD_SECONDS = 5
 
def record_audio():
    p = pyaudio.PyAudio()
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)
 
    print("Recording...")
    frames = []
 
    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)
 
    print("Finished recording.")
    stream.stop_stream()
    stream.close()
    p.terminate()
 
    return frames
 
def transcribe_audio(frames):
    # Save to temporary file
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio:
        wf = wave.open(temp_audio.name, 'wb')
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(pyaudio.PyAudio().get_sample_size(FORMAT))
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))
        wf.close()
 
        # Transcribe
        result = model.transcribe(temp_audio.name)
        os.unlink(temp_audio.name)
 
        return result["text"]
 
# Main loop
while True:
    input("Press Enter to start recording...")
    frames = record_audio()
    text = transcribe_audio(frames)
    print(f"Transcription: {text}\n")

Option 2: Faster Whisper

Faster implementation of Whisper with CTranslate2 - up to 4x faster with lower memory usage.

		# Install
pip install faster-whisper
 
# Usage
from faster_whisper import WhisperModel
 
model = WhisperModel("base", device="cpu", compute_type="int8")
 
segments, info = model.transcribe("audio.mp3", beam_size=5)
 
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Option 3: Whisper.cpp

C++ implementation of Whisper for maximum performance.

		# Clone and build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
 
# Download models
bash ./models/download-ggml-model.sh base
 
# Transcribe
./main -m models/ggml-base.bin -f audio.wav

Option 4: Vosk (Lightweight Alternative)

Vosk is a lightweight offline speech recognition toolkit, great for resource-constrained systems.

		# Install
pip install vosk
 
# Download models from https://alphacephei.com/vosk/models
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

		from vosk import Model, KaldiRecognizer
import wave
import json
 
model = Model("vosk-model-small-en-us-0.15")
wf = wave.open("audio.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())
 
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        result = json.loads(rec.Result())
        print(result["text"])
 
# Final result
print(json.loads(rec.FinalResult())["text"])

Option 5: Using Hugging Face Transformers

Direct access to Whisper models via Hugging Face.

		pip install transformers torch

		from transformers import pipeline
 
# Load Whisper model
transcriber = pipeline("automatic-speech-recognition",
                      model="openai/whisper-large-v3")
 
# Transcribe
result = transcriber("audio.mp3")
print(result["text"])
 
# With language specification
result = transcriber("audio.mp3",
                     generate_kwargs={"language": "english"})

System-wide Integration

Creating a Global Transcribe Command

		# Create script at ~/bin/transcribe
#!/bin/bash
 
if [ -z "$1" ]; then
    echo "Usage: transcribe <audio-file>"
    exit 1
fi
 
whisper "$1" --model base --output_format txt --output_dir "$(dirname "$1")"

		# Make executable
chmod +x ~/bin/transcribe
 
# Add to PATH (add to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/bin:$PATH"
 
# Usage
transcribe recording.mp3

Dmenu/Rofi Integration for Quick Recording

		#!/bin/bash
# ~/bin/voice-note
 
RECORDINGS_DIR="$HOME/voice-notes"
mkdir -p "$RECORDINGS_DIR"
 
FILENAME="$RECORDINGS_DIR/note-$(date +%Y%m%d-%H%M%S).wav"
 
# Record audio
arecord -f cd -d 10 "$FILENAME"
 
# Transcribe
whisper "$FILENAME" --model base --output_format txt
 
# Show notification
notify-send "Voice Note" "Transcription complete!"

Bind to a hotkey in your window manager for quick voice notes.

GPU Acceleration

For NVIDIA GPUs, install CUDA support for significant speedup:

		# Install PyTorch with CUDA
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 
# Whisper will automatically use GPU if available
whisper audio.mp3 --model large  # Will use GPU

Performance Analysis

Word Error Rates (WER) by Language

Whisper Large-v3 Performance:

English: ~5-8% WER
Spanish: ~8-12% WER
French: ~6-10% WER
German: ~8-15% WER
Japanese: ~10-20% WER
Chinese: ~15-25% WER

Comparative Analysis:

Whisper Large: ~95% accuracy (English), state-of-the-art multilingual
Whisper Medium: ~90% accuracy, good balance for most applications
Whisper Small: ~85% accuracy, fast and lightweight
Vosk: ~80% accuracy, excellent for real-time applications
Google Cloud Speech: ~95% accuracy (requires internet)
Azure Speech Services: ~90-95% accuracy (cloud-based)

Performance Benchmarks

Inference Speed (relative to Large model):

Tiny: ~32x faster (39M parameters)
Base: ~16x faster (74M parameters)
Small: ~6x faster (244M parameters)
Medium: ~2x faster (769M parameters)
Turbo: ~8x faster (809M parameters, optimized)

Memory Requirements:

Tiny/Base: ~1GB VRAM
Small: ~2GB VRAM
Medium: ~5GB VRAM
Large: ~10GB VRAM
Turbo: ~6GB VRAM

Multilingual Performance

Whisper's multilingual capabilities stem from training on 680,000 hours of data:

English: 65% of training data (highest accuracy)
Multilingual: 35% of training data (17% dedicated to non-English tasks)
Supported Languages: 99 languages total
Translation: Direct translation from any supported language to English

Performance Optimization Tips

Choose the right model - base or small for most use cases, turbo for speed
Use Faster-Whisper - For production applications (up to 4x faster)
Enable GPU - 10-20x speedup on NVIDIA GPUs with CUDA
Batch processing - Process multiple files at once
Use int8 quantization - With faster-whisper for lower memory usage
Specify language - Explicit language setting improves accuracy
Use turbo model - Optimized performance with minimal accuracy loss

Cost Analysis (Cloud Infrastructure)

For transcribing 1,000 hours of audio on GCP with A100 GPU:

Model	Batch Size 1	Batch Size 4	Batch Size 16
Tiny	$15.60	$12.50	$11.70
Base	$23.40	$18.80	$17.50
Small	$54.70	$43.80	$40.90
Medium	$140.60	$112.50	$104.70
Large	$281.30	$225.00	$209.40
Turbo	$171.90	$137.50	$128.10

Costs based on late 2022 GCP pricing, exclude headcount and infrastructure setup costs.

Use Cases

Meeting Transcription

		# Record meeting
arecord -f cd -d 3600 meeting.wav
 
# Transcribe with timestamps
whisper meeting.wav --output_format srt --model medium

YouTube Video Transcription

		# Download audio with yt-dlp
yt-dlp -x --audio-format mp3 "VIDEO_URL"
 
# Transcribe
whisper "video.mp3" --model base

Podcast Processing

		# Batch transcribe all podcast episodes
for file in podcasts/*.mp3; do
    whisper "$file" --model small --output_format txt
done

Live Captioning

Create a simple live captioning system for accessibility.

Accuracy Comparison

Based on common benchmarks:

Whisper Large: ~95% accuracy (English)
Whisper Medium: ~90% accuracy
Whisper Small: ~85% accuracy
Vosk: ~80% accuracy
Google Cloud: ~95% accuracy (requires internet)

Troubleshooting

Out of Memory Errors

		# Use smaller model
whisper audio.mp3 --model tiny
 
# Or use faster-whisper with int8

Poor Quality Transcription

		# Specify language
whisper audio.mp3 --language English --model medium
 
# Use larger model
whisper audio.mp3 --model large

Slow Performance

		# Use whisper.cpp or faster-whisper
# Enable GPU acceleration
# Use smaller model

Advanced Features and Techniques

Custom Model Fine-tuning

For domain-specific applications, fine-tune Whisper on your own dataset:

		# Install required packages
pip install datasets transformers accelerate
 
# Prepare your dataset (audio-text pairs)
# Then fine-tune
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
 
# Load pre-trained model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
 
# Fine-tune on custom dataset
# (Implementation details depend on your specific use case)

Speaker Diarization

Combine Whisper with speaker identification for multi-speaker transcripts:

		# Install pyannote.audio for speaker diarization
pip install pyannote.audio
 
# Usage example
from pyannote.audio import Pipeline
from pyannote.core import Segment, Annotation
 
# Initialize speaker diarization pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
 
# Process audio
diarization = pipeline("meeting.wav")
 
# Combine with Whisper transcription
# (Implementation would merge speaker segments with transcribed text)

Real-time Streaming

For live audio streams, implement streaming transcription:

		import asyncio
import websockets
import json
from faster_whisper import WhisperModel
 
class StreamingTranscriber:
    def __init__(self):
        self.model = WhisperModel("base", device="cpu", compute_type="int8")
        self.audio_buffer = []
        self.buffer_duration = 30  # seconds
        
    async def transcribe_stream(self, websocket):
        async for message in websocket:
            # Receive audio chunks
            audio_chunk = json.loads(message)["audio"]
            self.audio_buffer.append(audio_chunk)
            
            # Process when buffer is full
            if len(self.audio_buffer) >= self.buffer_duration * 16:  # 16kHz sample rate
                segments, _ = self.model.transcribe(
                    self._buffer_to_audio(),
                    language="en",
                    beam_size=5,
                    vad_filter=True
                )
                
                # Send transcription back
                for segment in segments:
                    await websocket.send(json.dumps({
                        "text": segment.text,
                        "start": segment.start,
                        "end": segment.end
                    }))
                
                self.audio_buffer = []  # Clear buffer

Integration with Linux Tools

Pipe Audio Through SoX for Preprocessing

		# Normalize audio levels before transcription
sox input.wav output.wav norm
 
# Remove silence
sox input.wav output.wav silence 1 0.1 1% reverse silence 1 0.1 1% reverse
 
# Convert sample rate
sox input.wav -r 16000 output.wav
 
# Chain with Whisper
sox input.wav -r 16000 - | whisper - --model base

Cron Job for Automated Transcription

		# Add to crontab for daily transcription of recorded files
# crontab -e
# 0 2 * * * /home/user/transcribe_daily.sh
 
#!/bin/bash
# transcribe_daily.sh
RECORDINGS_DIR="/home/user/recordings"
OUTPUT_DIR="/home/user/transcripts"
 
for file in "$RECORDINGS_DIR"/*.wav; do
    if [ -f "$file" ]; then
        filename=$(basename "$file" .wav)
        whisper "$file" --model base --output_dir "$OUTPUT_DIR" --output_format txt
        mv "$file" "${RECORDINGS_DIR}/processed/"
    fi
done

Resources

OpenAI Whisper GitHub - Official repository with detailed documentation
Whisper Model Card - Hugging Face - Model details and performance metrics
Faster Whisper - Optimized Whisper implementation
Whisper.cpp - C++ implementation for maximum performance
Vosk Models - Pre-trained Vosk models
Hugging Face Transformers - Alternative Whisper integration
Whisper-Vosk Hybrid Approach - Real-time correction techniques
OpenAI Whisper Performance Analysis - Detailed performance benchmarks
Whisper vs Open-Source Alternatives - Comprehensive model comparison
OpenAI Whisper Glossary - Concise overview of Whisper's capabilities and architecture
Whisper AI Benefits and Risks - Balanced analysis of Whisper's advantages and potential harms

Conclusion

AI speech-to-text on Linux has never been more accessible. With Whisper, you get state-of-the-art accuracy running completely offline on your machine. Whether you're transcribing meetings, processing podcasts, or building voice-controlled applications, these tools provide powerful capabilities with complete privacy and control.

Start with Whisper's base model for general use, and scale up to larger models or GPU acceleration as needed. The future of voice computing is open source and running on your Linux box.