Skip to main content

AI Speech-to-Text on Linux - Complete Guide

Comprehensive guide to setting up AI-powered speech-to-text on Linux using OpenAI Whisper, Vosk, and other tools for accurate audio transcription

AI Speech-to-Text on Linux - Complete Guide

Introduction

Speech-to-text technology has revolutionized how we interact with computers. With modern AI models like OpenAI Whisper, you can achieve near-human accuracy for transcription on your Linux machine. This guide covers multiple solutions from local AI models to cloud services, all running on Linux.

Why Use AI Speech-to-Text on Linux?

  • Privacy - Process audio locally without sending to cloud services
  • Offline capability - Work without internet connection
  • Cost-effective - No subscription fees for local models
  • Customization - Fine-tune models for specific domains
  • Integration - Easy integration with Linux workflows and scripts

Whisper is OpenAI's open-source speech recognition model with exceptional accuracy across multiple languages. It's trained on 680,000 hours of multilingual data, making it robust to accents, background noise, and technical language.

Key Features

  • Multilingual Support: 99 languages with automatic language detection
  • Multitask Model: Speech recognition, translation, and language identification
  • High Accuracy: Reduces word error rates by 50% compared to other models
  • Robust Performance: Works well with background noise and technical language
  • Translation: Can translate non-English speech to English

Installation

		# Install Python and pip if not already installed
sudo pacman -S python python-pip  # Arch
sudo apt install python3 python3-pip  # Ubuntu/Debian
 
# Install ffmpeg for audio processing
sudo pacman -S ffmpeg  # Arch
sudo apt install ffmpeg  # Ubuntu/Debian
 
# Install Whisper
pip install -U openai-whisper
	

Available Models

Model Parameters English-only Multilingual Required VRAM Relative Speed
tiny 39 M ~1 GB ~32x
base 74 M ~1 GB ~16x
small 244 M ~2 GB ~6x
medium 769 M ~5 GB ~2x
large 1550 M ~10 GB 1x
turbo 809 M ~6 GB ~8x

The turbo model is an optimized version of large-v3 offering faster transcription with minimal accuracy loss.

Basic Usage

		# Transcribe an audio file
whisper audio.mp3
 
# Specify model size (tiny, base, small, medium, large, turbo)
whisper audio.mp3 --model medium
 
# Output to specific format
whisper audio.mp3 --output_format txt
 
# Transcribe with timestamps
whisper audio.mp3 --output_format srt
 
# Specify language for better accuracy
whisper audio.mp3 --language English
 
# Translate to English
whisper audio.mp3 --task translate
	

Python API

		import whisper
 
# Load model
model = whisper.load_model("turbo")
 
# Transcribe
result = model.transcribe("audio.mp3")
 
# Print result
print(result["text"])
 
# Get detailed segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")
	

Hybrid Whisper-Vosk Real-Time Transcription

For applications requiring both speed and accuracy, consider a hybrid approach combining Whisper and Vosk. This method uses Vosk for fast real-time transcription with Whisper running in the background to correct errors.

How It Works

  1. Vosk provides real-time transcription via WebSocket for immediate feedback
  2. Whisper processes the same audio in the background with a short delay
  3. Compare outputs using Levenshtein distance to detect significant differences
  4. Automatically correct VOSK's output when Whisper disagrees

Implementation Example

		import vosk
import whisper
import asyncio
import Levenshtein
from vosk import KaldiRecognizer
 
class HybridTranscriber:
    def __init__(self):
        # Initialize Vosk for real-time
        self.vosk_model = vosk.Model("vosk-model-small-en-us")
        self.recognizer = KaldiRecognizer(self.vosk_model, 16000)
        
        # Initialize Whisper for accuracy checking
        self.whisper_model = whisper.load_model("base")
        
        # Audio buffer for Whisper
        self.audio_buffer = []
        self.correction_delay = 2.0  # seconds
        
    async def transcribe_with_corrections(self, audio_stream):
        vosk_text = ""
        whisper_corrections = []
        
        # Start both transcription processes
        vosk_task = asyncio.create_task(self._vosk_transcribe(audio_stream))
        whisper_task = asyncio.create_task(self._whisper_correct(audio_stream))
        
        # Process results
        while True:
            vosk_result = await vosk_task
            whisper_result = await whisper_task
            
            if vosk_result:
                vosk_text += vosk_result
                print(f"VOSK: {vosk_result}")
            
            if whisper_result:
                # Check if correction is needed
                distance = Levenshtein.distance(vosk_text[-len(whisper_result):], whisper_result)
                if distance > len(whisper_result) * 0.3:  # 30% difference threshold
                    print(f"WHISPER CORRECTION: {vosk_result} -> {whisper_result}")
                    vosk_text = vosk_text[:-len(vosk_result)] + whisper_result
    
    async def _vosk_transcribe(self, audio_stream):
        # Real-time Vosk transcription
        while True:
            data = await audio_stream.read(4000)
            if self.recognizer.AcceptWaveform(data):
                result = json.loads(self.recognizer.Result())
                return result["text"]
    
    async def _whisper_correct(self, audio_stream):
        # Background Whisper correction
        await asyncio.sleep(self.correction_delay)
        # Process accumulated audio with Whisper
        result = self.whisper_model.transcribe("temp_audio.wav")
        return result["text"]
	

This hybrid approach provides:

  • Immediate feedback from Vosk (real-time)
  • High accuracy corrections from Whisper (1-2 second delay)
  • Visual indicators when corrections are applied
  • Trust scoring to weigh model confidence

Model Comparison: Whisper vs Alternatives

Based on comprehensive benchmarks, here's how Whisper compares to other open-source transcription models:

Accuracy Comparison

Model Word Error Rate Strengths Limitations
Whisper Large ~5-10% State-of-the-art accuracy, multilingual, robust to noise High resource requirements
Whisper Medium ~10-15% Good balance of accuracy/speed Still resource-intensive
Whisper Small ~15-25% Fast, good for most applications Lower accuracy on complex audio
Vosk ~15-30% Fast, lightweight, real-time capable Limited language support
Kaldi ~10-20% Highly customizable, accurate Complex setup, steep learning curve
Coqui STT ~15-25% Community-driven, multilingual Maintenance mode, limited updates

Setup Complexity

  • Whisper: Simple pip install, works out-of-the-box
  • Vosk: Easy download + pip install, minimal setup
  • Kaldi: Complex installation, requires technical expertise
  • Coqui STT: Moderate setup complexity

Performance Metrics

  • Whisper Turbo: 8x faster than Large with minimal accuracy loss
  • Vosk: Real-time performance on low-end hardware
  • Faster-Whisper: Up to 4x faster than original Whisper
  • Whisper.cpp: Optimized for CPU inference

Real-time Microphone Transcription

		import whisper
import pyaudio
import wave
import tempfile
import os
 
model = whisper.load_model("base")
 
# Audio recording parameters
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
RECORD_SECONDS = 5
 
def record_audio():
    p = pyaudio.PyAudio()
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)
 
    print("Recording...")
    frames = []
 
    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)
 
    print("Finished recording.")
    stream.stop_stream()
    stream.close()
    p.terminate()
 
    return frames
 
def transcribe_audio(frames):
    # Save to temporary file
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio:
        wf = wave.open(temp_audio.name, 'wb')
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(pyaudio.PyAudio().get_sample_size(FORMAT))
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))
        wf.close()
 
        # Transcribe
        result = model.transcribe(temp_audio.name)
        os.unlink(temp_audio.name)
 
        return result["text"]
 
# Main loop
while True:
    input("Press Enter to start recording...")
    frames = record_audio()
    text = transcribe_audio(frames)
    print(f"Transcription: {text}\n")
	

Option 2: Faster Whisper

Faster implementation of Whisper with CTranslate2 - up to 4x faster with lower memory usage.

		# Install
pip install faster-whisper
 
# Usage
from faster_whisper import WhisperModel
 
model = WhisperModel("base", device="cpu", compute_type="int8")
 
segments, info = model.transcribe("audio.mp3", beam_size=5)
 
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
	

Option 3: Whisper.cpp

C++ implementation of Whisper for maximum performance.

		# Clone and build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
 
# Download models
bash ./models/download-ggml-model.sh base
 
# Transcribe
./main -m models/ggml-base.bin -f audio.wav
	

Option 4: Vosk (Lightweight Alternative)

Vosk is a lightweight offline speech recognition toolkit, great for resource-constrained systems.

		# Install
pip install vosk
 
# Download models from https://alphacephei.com/vosk/models
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
	
		from vosk import Model, KaldiRecognizer
import wave
import json
 
model = Model("vosk-model-small-en-us-0.15")
wf = wave.open("audio.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())
 
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        result = json.loads(rec.Result())
        print(result["text"])
 
# Final result
print(json.loads(rec.FinalResult())["text"])
	

Option 5: Using Hugging Face Transformers

Direct access to Whisper models via Hugging Face.

		pip install transformers torch
	
		from transformers import pipeline
 
# Load Whisper model
transcriber = pipeline("automatic-speech-recognition",
                      model="openai/whisper-large-v3")
 
# Transcribe
result = transcriber("audio.mp3")
print(result["text"])
 
# With language specification
result = transcriber("audio.mp3",
                     generate_kwargs={"language": "english"})
	

System-wide Integration

Creating a Global Transcribe Command

		# Create script at ~/bin/transcribe
#!/bin/bash
 
if [ -z "$1" ]; then
    echo "Usage: transcribe <audio-file>"
    exit 1
fi
 
whisper "$1" --model base --output_format txt --output_dir "$(dirname "$1")"
	
		# Make executable
chmod +x ~/bin/transcribe
 
# Add to PATH (add to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/bin:$PATH"
 
# Usage
transcribe recording.mp3
	

Dmenu/Rofi Integration for Quick Recording

		#!/bin/bash
# ~/bin/voice-note
 
RECORDINGS_DIR="$HOME/voice-notes"
mkdir -p "$RECORDINGS_DIR"
 
FILENAME="$RECORDINGS_DIR/note-$(date +%Y%m%d-%H%M%S).wav"
 
# Record audio
arecord -f cd -d 10 "$FILENAME"
 
# Transcribe
whisper "$FILENAME" --model base --output_format txt
 
# Show notification
notify-send "Voice Note" "Transcription complete!"
	

Bind to a hotkey in your window manager for quick voice notes.

GPU Acceleration

For NVIDIA GPUs, install CUDA support for significant speedup:

		# Install PyTorch with CUDA
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 
# Whisper will automatically use GPU if available
whisper audio.mp3 --model large  # Will use GPU
	

Performance Analysis

Word Error Rates (WER) by Language

Whisper Large-v3 Performance:

  • English: ~5-8% WER
  • Spanish: ~8-12% WER
  • French: ~6-10% WER
  • German: ~8-15% WER
  • Japanese: ~10-20% WER
  • Chinese: ~15-25% WER

Comparative Analysis:

  • Whisper Large: ~95% accuracy (English), state-of-the-art multilingual
  • Whisper Medium: ~90% accuracy, good balance for most applications
  • Whisper Small: ~85% accuracy, fast and lightweight
  • Vosk: ~80% accuracy, excellent for real-time applications
  • Google Cloud Speech: ~95% accuracy (requires internet)
  • Azure Speech Services: ~90-95% accuracy (cloud-based)

Performance Benchmarks

Inference Speed (relative to Large model):

  • Tiny: ~32x faster (39M parameters)
  • Base: ~16x faster (74M parameters)
  • Small: ~6x faster (244M parameters)
  • Medium: ~2x faster (769M parameters)
  • Turbo: ~8x faster (809M parameters, optimized)

Memory Requirements:

  • Tiny/Base: ~1GB VRAM
  • Small: ~2GB VRAM
  • Medium: ~5GB VRAM
  • Large: ~10GB VRAM
  • Turbo: ~6GB VRAM

Multilingual Performance

Whisper's multilingual capabilities stem from training on 680,000 hours of data:

  • English: 65% of training data (highest accuracy)
  • Multilingual: 35% of training data (17% dedicated to non-English tasks)
  • Supported Languages: 99 languages total
  • Translation: Direct translation from any supported language to English

Performance Optimization Tips

  1. Choose the right model - base or small for most use cases, turbo for speed
  2. Use Faster-Whisper - For production applications (up to 4x faster)
  3. Enable GPU - 10-20x speedup on NVIDIA GPUs with CUDA
  4. Batch processing - Process multiple files at once
  5. Use int8 quantization - With faster-whisper for lower memory usage
  6. Specify language - Explicit language setting improves accuracy
  7. Use turbo model - Optimized performance with minimal accuracy loss

Cost Analysis (Cloud Infrastructure)

For transcribing 1,000 hours of audio on GCP with A100 GPU:

Model Batch Size 1 Batch Size 4 Batch Size 16
Tiny $15.60 $12.50 $11.70
Base $23.40 $18.80 $17.50
Small $54.70 $43.80 $40.90
Medium $140.60 $112.50 $104.70
Large $281.30 $225.00 $209.40
Turbo $171.90 $137.50 $128.10

Costs based on late 2022 GCP pricing, exclude headcount and infrastructure setup costs.

Use Cases

Meeting Transcription

		# Record meeting
arecord -f cd -d 3600 meeting.wav
 
# Transcribe with timestamps
whisper meeting.wav --output_format srt --model medium
	

YouTube Video Transcription

		# Download audio with yt-dlp
yt-dlp -x --audio-format mp3 "VIDEO_URL"
 
# Transcribe
whisper "video.mp3" --model base
	

Podcast Processing

		# Batch transcribe all podcast episodes
for file in podcasts/*.mp3; do
    whisper "$file" --model small --output_format txt
done
	

Live Captioning

Create a simple live captioning system for accessibility.

Accuracy Comparison

Based on common benchmarks:

  • Whisper Large: ~95% accuracy (English)
  • Whisper Medium: ~90% accuracy
  • Whisper Small: ~85% accuracy
  • Vosk: ~80% accuracy
  • Google Cloud: ~95% accuracy (requires internet)

Troubleshooting

Out of Memory Errors

		# Use smaller model
whisper audio.mp3 --model tiny
 
# Or use faster-whisper with int8
	

Poor Quality Transcription

		# Specify language
whisper audio.mp3 --language English --model medium
 
# Use larger model
whisper audio.mp3 --model large
	

Slow Performance

		# Use whisper.cpp or faster-whisper
# Enable GPU acceleration
# Use smaller model
	

Advanced Features and Techniques

Custom Model Fine-tuning

For domain-specific applications, fine-tune Whisper on your own dataset:

		# Install required packages
pip install datasets transformers accelerate
 
# Prepare your dataset (audio-text pairs)
# Then fine-tune
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
 
# Load pre-trained model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
 
# Fine-tune on custom dataset
# (Implementation details depend on your specific use case)
	

Speaker Diarization

Combine Whisper with speaker identification for multi-speaker transcripts:

		# Install pyannote.audio for speaker diarization
pip install pyannote.audio
 
# Usage example
from pyannote.audio import Pipeline
from pyannote.core import Segment, Annotation
 
# Initialize speaker diarization pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
 
# Process audio
diarization = pipeline("meeting.wav")
 
# Combine with Whisper transcription
# (Implementation would merge speaker segments with transcribed text)
	

Real-time Streaming

For live audio streams, implement streaming transcription:

		import asyncio
import websockets
import json
from faster_whisper import WhisperModel
 
class StreamingTranscriber:
    def __init__(self):
        self.model = WhisperModel("base", device="cpu", compute_type="int8")
        self.audio_buffer = []
        self.buffer_duration = 30  # seconds
        
    async def transcribe_stream(self, websocket):
        async for message in websocket:
            # Receive audio chunks
            audio_chunk = json.loads(message)["audio"]
            self.audio_buffer.append(audio_chunk)
            
            # Process when buffer is full
            if len(self.audio_buffer) >= self.buffer_duration * 16:  # 16kHz sample rate
                segments, _ = self.model.transcribe(
                    self._buffer_to_audio(),
                    language="en",
                    beam_size=5,
                    vad_filter=True
                )
                
                # Send transcription back
                for segment in segments:
                    await websocket.send(json.dumps({
                        "text": segment.text,
                        "start": segment.start,
                        "end": segment.end
                    }))
                
                self.audio_buffer = []  # Clear buffer
	

Integration with Linux Tools

Pipe Audio Through SoX for Preprocessing

		# Normalize audio levels before transcription
sox input.wav output.wav norm
 
# Remove silence
sox input.wav output.wav silence 1 0.1 1% reverse silence 1 0.1 1% reverse
 
# Convert sample rate
sox input.wav -r 16000 output.wav
 
# Chain with Whisper
sox input.wav -r 16000 - | whisper - --model base
	

Cron Job for Automated Transcription

		# Add to crontab for daily transcription of recorded files
# crontab -e
# 0 2 * * * /home/user/transcribe_daily.sh
 
#!/bin/bash
# transcribe_daily.sh
RECORDINGS_DIR="/home/user/recordings"
OUTPUT_DIR="/home/user/transcripts"
 
for file in "$RECORDINGS_DIR"/*.wav; do
    if [ -f "$file" ]; then
        filename=$(basename "$file" .wav)
        whisper "$file" --model base --output_dir "$OUTPUT_DIR" --output_format txt
        mv "$file" "${RECORDINGS_DIR}/processed/"
    fi
done
	

Resources

Conclusion

AI speech-to-text on Linux has never been more accessible. With Whisper, you get state-of-the-art accuracy running completely offline on your machine. Whether you're transcribing meetings, processing podcasts, or building voice-controlled applications, these tools provide powerful capabilities with complete privacy and control.

Start with Whisper's base model for general use, and scale up to larger models or GPU acceleration as needed. The future of voice computing is open source and running on your Linux box.