April 15, 2026·6 min read

How real-time interview transcription works

A plain-English walkthrough of how Smart Interview turns interview audio into transcripts in under a second — and why latency matters more than raw accuracy.

Real-time interview transcription is the ability to convert spoken interview audio into text fast enough that the candidate can read and react before the conversation moves on. It is different from offline transcription, which can take minutes to process a recording and is fine for archives but useless during a live call.

The pipeline has three stages. First, the desktop app captures system audio from the call platform — Zoom, Microsoft Teams, or Google Meet — using OS-level audio loopback. Second, the audio is streamed in small chunks (typically 100–250 ms) to a speech-to-text engine over a persistent WebSocket connection. Third, partial transcripts are returned as the engine processes each chunk; final transcripts arrive when the engine detects an end-of-utterance silence.

The latency budget is tight. A candidate has roughly 800 ms before their pause becomes awkward. That means the entire round-trip — capture, network, recognition, render — has to fit inside that window. Smart Interview targets sub-second end-to-end latency.

Accuracy matters less than people assume. A 95% accurate transcript that arrives in 600 ms is more useful than a 99% accurate transcript that arrives in 4 seconds. The candidate can mentally fill in the missing 5% from context; they cannot rewind a real conversation.

The next step is generating an answer. Smart Interview feeds the rolling transcript, the candidate's resume, and the job description into a large language model and streams the response token-by-token into an invisible window. Streaming matters here for the same reason: the candidate starts reading the answer before generation finishes.