Case Study

Voiux FSR API

Production-ready voice transcription for developers

Built a self-hosted voice transcription API powered by Whisper, offering developers a simple REST interface for converting audio to text with speaker diarization support.

Book a strategy session Request API access

Client

Voiux

A voice transcription API designed for developers who need reliable, accurate speech-to-text conversion without the complexity of managing ML infrastructure.

Studio

tuniverstudio

A boutique AI integration studio specialising in conversational agents, AI-first web experiences and voice receptionists for ambitious teams.

Deployment

Self-hosted (hw-asus:3006)

Full control over data and infrastructure.

Target users

Developers & integration teams

REST API designed for easy integration.

Core model

OpenAI Whisper

Industry-leading accuracy for multiple languages.

Implementation timeline

API design

Week 1

Designed REST API endpoints, request/response schemas, and error handling patterns following OpenAPI best practices.

Core implementation

Week 2

Built FastAPI backend with Whisper integration, audio preprocessing pipeline, and speaker diarization module.

Deployment & docs

Week 3

Deployed to self-hosted server, added authentication, rate limiting, and comprehensive API documentation.

introduction

Introduction & Background

Voice transcription is a cornerstone feature for many modern applications—from meeting notes to content accessibility. But integrating speech-to-text often means choosing between expensive cloud APIs with privacy concerns or wrestling with complex ML pipelines. Voiux was built to give developers a simple, self-hosted alternative with enterprise-grade accuracy.

Self-hosted means full data sovereignty
Whisper delivers state-of-the-art transcription accuracy

challenge

Problem & Constraints

Developers need reliable voice transcription that works across different audio formats, handles multiple speakers, and doesn't require sending sensitive recordings to third-party cloud services. The solution had to be easy to integrate via standard REST API while supporting advanced features like speaker diarization.

Must handle various audio formats (MP3, WAV, M4A, etc.)
Speaker diarization required for meeting transcription use cases

solution

Solution Overview

We built a FastAPI-based transcription service that wraps OpenAI's Whisper model with production-ready features. The API accepts audio uploads, processes them through a preprocessing pipeline (normalization, noise reduction), runs Whisper transcription, and optionally performs speaker diarization using Pyannote Audio.

Clean REST API with JSON responses
Speaker timestamps and confidence scores included in output

implementation

Implementation Journey

The system uses FFmpeg for audio preprocessing and format conversion. Whisper handles the core transcription with support for 99+ languages. For speaker diarization, we integrated Pyannote Audio to identify and label different speakers in the recording. FastAPI provides automatic OpenAPI documentation and async request handling.

Audio preprocessing with FFmpeg for format normalization
Whisper large-v3 model for maximum accuracy
Speaker diarization with Pyannote Audio segmentation
ReSTful API with FastAPI automatic documentation
Docker containerization for consistent deployment

results

Results & Impact

The Voiux FSR API provides developers with a straightforward endpoint for voice transcription. Running on self-hosted infrastructure gives teams full control over their audio data while still benefiting from state-of-the-art ML models.

Simple REST integration for any programming language
Privacy-preserving self-hosted deployment

Details we can still add

Answering these will let us enrich the manuscript with sharper narrative proof points and concrete outcomes.

Scale: What is the maximum audio duration and concurrent request capacity?
Languages: Which languages have been tested most extensively?
Pricing: Is this offered as a service or only self-hosted deployment?
Features: Are there plans for real-time streaming transcription?

Stack at a glance

Python
FastAPI
OpenAI Whisper
FFmpeg
Pyannote Audio
Docker

Add voice intelligence to your product

We help teams integrate speech-to-text, voice agents, and audio analytics. Bring us your use case—we'll design the pipeline.

Schedule a strategy call Share your use case