Getting Started

Add streaming speech-to-text to your app in minutes using a single WebSocket endpoint.

Get Started

Replace <Your API key> with your router.audio API key or the API key of any provider you want to use. These examples use your microphone as the audio source, but you can also stream pre-recorded audio or audio from any other source.

import asyncio
import websockets
import json
import sounddevice as sd
from urllib.parse import urlencode

async def transcribe():
    params = {
        "provider": "auto",
        "encoding": "pcm_s16le",
        "sample_rate": "16000",
    }
    async with websockets.connect(
        f"wss://api.router.audio/v1/listen?{urlencode(params)}",
        additional_headers={"x-api-key": "<Your API key>"}
    ) as ws:
        loop = asyncio.get_event_loop()
        mic = sd.RawInputStream(samplerate=16000, channels=1, dtype="int16")
        mic.start()

        async def send_audio():
            while True:
                data, _ = await loop.run_in_executor(None, mic.read, 4000)
                await ws.send(bytes(data))

        async def recv_text():
            async for msg in ws:
                data = json.loads(msg)
                if data.get("type") == "transcript":
                    print(data["transcript"])

        await asyncio.gather(send_audio(), recv_text())

asyncio.run(transcribe())

Connecting

The connection is a standard WebSocket to wss://api.router.audio/v1/listen. You pass the provider, audio encoding, and sample rate as query parameters, and authenticate with your API key either as an api_key query parameter or an x-api-key header. router.audio then opens a WebSocket to the provider on your behalf.

See the API reference for the full list of parameters.

Streaming

Once connected, send raw audio as binary WebSocket frames continuously. Transcripts come back as JSON text frames in real time:

{
  "type": "transcript",
  "provider": "auto",
  "transcript": "Hello, how are you?",
  "start_time": 0.0,
  "end_time": 0.95,
  "is_partial": false,
  "words": [
    {
      "text": "hello",
      "start_time": 0.0,
      "end_time": 0.42,
      "speaker": null
    },
    {
      "text": "how",
      "start_time": 0.48,
      "end_time": 0.61,
      "speaker": null
    },
    {
      "text": "are",
      "start_time": 0.62,
      "end_time": 0.75,
      "speaker": null
    },
    {
      "text": "you",
      "start_time": 0.76,
      "end_time": 0.90,
      "speaker": null
    }
  ]
}

For the full response schema and optional features like partial results and diarization, see the API reference .