File Transcription
Upload an audio file and receive a text transcription. This endpoint is compatible with the OpenAI Audio Transcription API .
POST /v1/audio/transcriptions View the full OpenAPI reference for file transcription.
wav, mp3, flac, ogg, m4a, webm — up to 15 MB per file.
Request
curl -X POST "https://stt.freyavoice.ai/v1/audio/transcriptions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@audio.wav" \
-F "model=freya-stt" \
-F "response_format=verbose_json"
Parameter Type Default Description filebinary required The audio file to transcribe. modelstring freya-sttModel identifier. response_formatstring jsonOne of json, text, or verbose_json. temperaturenumber 0.0Sampling temperature (0–1). Lower is more deterministic.
json (default)
verbose_json
text
{
"text" : "Merhaba, bu bir test konuşmasıdır." ,
"inference_time_ms" : 342.5
}
Includes detected language and word-level confidence scores: {
"text" : "Merhaba, bu bir test konuşmasıdır." ,
"language" : "Turkish" ,
"inference_time_ms" : 342.5 ,
"words" : [
{ "word" : "Merhaba," , "confidence" : 0.98 },
{ "word" : "bu" , "confidence" : 0.95 },
{ "word" : "bir" , "confidence" : 0.97 },
{ "word" : "test" , "confidence" : 0.92 },
{ "word" : "konuşmasıdır." , "confidence" : 0.89 }
]
}
Returns plain text with no JSON wrapper: Merhaba, bu bir test konuşmasıdır.
Real-Time Streaming (WebSocket)
For real-time transcription, connect via WebSocket and stream audio frames. The server uses LocalAgreement to emit stable (confirmed) words as they are recognized, plus tentative partial results.
Real-time streaming is currently in beta . The WebSocket endpoint is available at wss://stt.freyavoice.ai/v1/audio/stream.
Connection
wss://stt.freyavoice.ai/v1/audio/stream?token=YOUR_API_KEY
Pass your API key as the token query parameter. Non-browser clients can alternatively use the Authorization: Bearer YOUR_API_KEY header during the WebSocket handshake.
Wire Protocol
The session follows a simple three-phase protocol:
1. Configure — Send a JSON text frame with session settings:
{
"type" : "config" ,
"sample_rate" : 16000
}
Field Type Default Description sample_rateinteger 16000Sample rate of the audio you’ll send (resampled to 16 kHz internally).
2. Stream audio — Send raw PCM audio as binary WebSocket frames .
Format: 16-bit signed integer, little-endian, mono
Send frames continuously as audio is captured (e.g. every 100–500 ms)
3. End session — Send a JSON text frame to signal end-of-audio:
Server Messages
The server sends JSON text frames throughout the session:
Type Description Fields partialTentative (not yet confirmed) transcription of the latest audio textfinalNewly confirmed words — append these to your transcript textresultFinal result after eof — the complete transcription text, word_confidences, languageerrorAn error occurred message
partial
final
result
error
{ "type" : "partial" , "text" : "merhaba nasıl" }
{ "type" : "final" , "text" : "merhaba" }
{
"type" : "result" ,
"text" : "Merhaba, nasılsınız?" ,
"word_confidences" : [
{ "word" : "Merhaba," , "confidence" : 0.97 },
{ "word" : "nasılsınız?" , "confidence" : 0.94 }
],
"language" : "Turkish"
}
{ "type" : "error" , "message" : "Invalid config frame." }
How LocalAgreement Works
The server re-transcribes the accumulated audio buffer on every tick (~500 ms). It compares consecutive hypotheses and only emits words that both agree on as final. The unstable tail is sent as partial and may change on the next tick. This gives you low-latency confirmed words without hallucinated flicker.
JavaScript Example
const token = "YOUR_API_KEY" ;
const ws = new WebSocket ( `wss://stt.freyavoice.ai/v1/audio/stream?token= ${ token } ` );
ws . onopen = () => {
// 1. Send config
ws . send ( JSON . stringify ({
type: "config" ,
sample_rate: 16000
}));
// 2. Stream audio from microphone
navigator . mediaDevices . getUserMedia ({ audio: true }). then ( stream => {
const ctx = new AudioContext ({ sampleRate: 16000 });
const source = ctx . createMediaStreamSource ( stream );
const processor = ctx . createScriptProcessor ( 4096 , 1 , 1 );
processor . onaudioprocess = ( e ) => {
const float32 = e . inputBuffer . getChannelData ( 0 );
const int16 = new Int16Array ( float32 . length );
for ( let i = 0 ; i < float32 . length ; i ++ ) {
int16 [ i ] = Math . max ( - 32768 , Math . min ( 32767 , float32 [ i ] * 32768 ));
}
ws . send ( int16 . buffer );
};
source . connect ( processor );
processor . connect ( ctx . destination );
});
};
let transcript = "" ;
ws . onmessage = ( event ) => {
const msg = JSON . parse ( event . data );
if ( msg . type === "final" ) {
transcript += ( transcript ? " " : "" ) + msg . text ;
console . log ( "Confirmed:" , transcript );
} else if ( msg . type === "partial" ) {
console . log ( "Tentative:" , transcript + " " + msg . text );
} else if ( msg . type === "result" ) {
console . log ( "Final result:" , msg . text );
}
};
// 3. When done recording:
// ws.send(JSON.stringify({ type: "eof" }));
Python Example
import asyncio
import json
import wave
import websockets
API_KEY = "YOUR_API_KEY"
async def transcribe_stream ( audio_path : str ):
uri = f "wss://stt.freyavoice.ai/v1/audio/stream?token= { API_KEY } "
async with websockets.connect(uri) as ws:
# 1. Configure
await ws.send(json.dumps({
"type" : "config" ,
"sample_rate" : 16000 ,
}))
# 2. Stream audio in chunks
with wave.open(audio_path, "rb" ) as wf:
chunk_size = 8000 # 250ms at 16kHz mono 16-bit
while True :
data = wf.readframes(chunk_size // 2 )
if not data:
break
await ws.send(data)
await asyncio.sleep( 0.1 )
# 3. Signal end
await ws.send(json.dumps({ "type" : "eof" }))
# 4. Collect results
async for message in ws:
msg = json.loads(message)
if msg[ "type" ] == "final" :
print ( f "[confirmed] { msg[ 'text' ] } " )
elif msg[ "type" ] == "partial" :
print ( f "[partial] { msg[ 'text' ] } " )
elif msg[ "type" ] == "result" :
print ( f " \n Full transcript: { msg[ 'text' ] } " )
break
asyncio.run(transcribe_stream( "audio.wav" ))