Transcribe Video SDK sessions with Realtime Media Streams
With Realtime Media Streams (RTMS) now available for Video SDK, you can access per-participant audio streams over WebSockets and process them on your server. While RTMS also allows you to directly access the participant's transcript data, in this blog we'll showcase how to access real-time audio and transcribe it locally on your own server. You can learn more about Realtime Media Streams in our RTMS docs.
Prerequisites
- Node.js LTS
- A Zoom Video SDK account with universal credit enabled
- A service to tunnel your local server to a public URL (like Ngrok)
Enable RTMS for your Video SDK app
Before building, you need to configure event subscriptions in the Zoom App Marketplace to receive RTMS lifecycle events.
- Sign into the Zoom App Marketplace with your Video SDK credentials
- Navigate to Develop → Build Video SDK
- Under Add feature, enable Event Subscriptions
- Configure your subscription:
- Add a descriptive name for your subscription
- Add the RTMS Started and RTMS Stopped events
- Set your Event notification endpoint URL - this is where Zoom will send webhook events when RTMS sessions start and stop
- Save your configuration
You can use a service like ngrok to tunnel your local server to a public URL. Once installed you can run ngrok http 3000. Once configured, your server will receive webhook payloads when participants trigger RTMS in a Video SDK session:
Building the application
You can find the completed project on GitHub. To follow along, create a new node project and initialize a package.json file:
npm init -y
Create a .env file in the root directory with your Zoom credentials:
VITE_SDK_KEY=your_client_id
VITE_SDK_SECRET=your_client_secret
ZOOM_SECRET_TOKEN=your_webhook_secret_token
PORT=3000
Install the dependencies:
npm install express dotenv ws whisper-node
Download the Whisper model:
npx whisper-node download
At a high level, the flow looks like this:
- A webhook receives the RTMS start event from Zoom with connection details.
- Our server establishes a signaling WebSocket connection and authenticates.
- Upon successful handshake, we connect to the media WebSocket to receive audio data.
- Audio packets are buffered and transcribed locally using Whisper.
Sample app
We'll walk through the key components of the sample app to understand how RTMS works with Video SDK. In the repo, the server code lives under server/ and is organized into a few files:
server/
├── index.ts # Express server, webhook handler, RTMS signaling/media sockets
├── util.ts # Audio WAV conversion + transcript formatting
└── whisper.d.ts # Whisper typings for TypeScript
Building the application
Let's walk through building the application step by step, following the structure of index.ts.
Setting up the Express server
First, we create our Express server and import the dependencies in server/index.ts.
import http from "http";
import crypto from "crypto";
import dotenv from "dotenv";
import WebSocket from "ws";
import whisper from "whisper-node";
import express from "express";
import {
bufferToWaveFile,
formatTranscript,
type SampleAudioPacket,
type SampleTranscript,
} from "./util";
dotenv.config({ quiet: true });
const PORT = process.env.PORT || 3000;
const ZoomSecretToken = process.env.ZOOM_SECRET_TOKEN as string;
const ZoomClientId = process.env.VITE_SDK_KEY as string;
const ZoomClientSecret = process.env.VITE_SDK_SECRET as string;
const app = express();
app.use(express.json());
We read Zoom credentials from environment variables and use them throughout the server for webhook validation and RTMS signatures.
To access the data stream from RTMS we can either use the RTMS SDK or handle the webhook & WebSocket connections manually.
Using the SDK
You can use the rtms SDK to connect to RTMS streams. The SDK provides a simple interface to connect to RTMS streams and receive audio and transcript data.
import rtms from "@zoom/rtms";
rtms.onWebhookEvent(({ payload }) => {
const client = new rtms.Client();
client.setAudioParams({
contentType: 2,
codec: 1,
sampleRate: 16000,
channel: 1,
dataOpt: 1,
duration: 1000,
frameSize: 16000,
});
client.onTranscriptData((data, size, timestamp, metadata) =>
console.log(`${metadata.userName}: ${data}`),
); // transcript
client.onAudioData((data) => console.log(data)); // ArrayBuffer of audio data
client.join({ ...payload, client: ZoomClientId, secret: ZoomClientSecret });
});
You can directly access the transcript data from the onTranscriptData callback, but for transcribing the audio locally we'll use the onAudioData callback to get the audio data as an ArrayBuffer. If you're using the SDK you can skip to the Handling audio data for transcription section. Read on to learn how the webhooks and websockets work under the hood.
Handling webhook events
The server needs to listen for webhook events. We'll handle three scenarios: webhook validation, RTMS session start, and RTMS session stop.
Webhook validation
When you first configure your webhook URL in the Zoom Marketplace, Zoom validates it by sending a challenge:
app.post("/webhook", async (req, res) => {
const { event, payload } = req.body;
if (event === "endpoint.url_validation" && payload?.plainToken) {
const hash = crypto
.createHmac("sha256", ZoomSecretToken)
.update(payload.plainToken)
.digest("hex");
return res.json({
plainToken: payload.plainToken,
encryptedToken: hash,
});
}
res.sendStatus(200);
// ... handle other events below
});
We hash the plainToken using our secret token and return it as encryptedToken. This proves we own the webhook endpoint.
Starting an RTMS session
When someone starts RTMS in a Video SDK session, Zoom sends a session.rtms_started event with connection details:
if (event === "session.rtms_started") {
const { session_id, rtms_stream_id, server_urls } = payload;
console.log("Starting RTMS for session:", { payload });
connectToSignalingWebSocket(session_id, rtms_stream_id, server_urls);
}
// ...
The payload contains:
session_id- Unique identifier for the Video SDK sessionrtms_stream_id- Unique identifier for this RTMS streamserver_urls- Array of WebSocket URLs to connect to
We call connectToSignalingWebSocket with these values to establish the first WebSocket connection.
WebSocket connections
RTMS uses two WebSocket connections: one for signaling and one for media data. Here are the events we will use for our app:
| msg_type | Name | Description |
|---|---|---|
| 1 | SIGNALING_HAND_SHAKE_REQ | Signaling handshake request |
| 2 | SIGNALING_HAND_SHAKE_RESP | Signaling handshake response |
| 3 | DATA_HAND_SHAKE_REQ | Media handshake request |
| 4 | DATA_HAND_SHAKE_RESP | Media handshake response |
| 7 | CLIENT_READY_ACK | Client ready acknowledgement |
| 12 | KEEP_ALIVE_REQ | Keep-alive request |
| 13 | KEEP_ALIVE_RESP | Keep-alive response |
| 14 | AUDIO | Audio data packet |
Both connections require an HMAC-SHA256 signature for authentication using the format CLIENT_ID,session_id,rtms_stream_id signed with your client secret. This is implemented in utils/rtms.ts:
function generateSignature(sessionID: string, streamId: string): string {
const message = `${ZoomClientId},${sessionID},${streamId}`;
return crypto
.createHmac("sha256", ZoomClientSecret)
.update(message)
.digest("hex");
}
Signaling connection
The connectToSignalingWebSocket function (in server/index.ts) establishes the signaling connection and sends the initial handshake. Here's a simplified version:
function connectToSignalingWebSocket(session_id: string, rtmsStreamId: string, serverUrls: string ) {
const signalingWs = new WebSocket(serverUrls, [], {rejectUnauthorized: false});
signalingWs.on("open", () => {
signalingWs.send(
JSON.stringify({
msg_type: 1,
meeting_uuid: session_id,
session_id,
rtms_stream_id: rtmsStreamId,
signature: generateSignature(session_id, rtmsStreamId),
}))
});
signalingWs.on("message", (data) => {
const msg = JSON.parse(data.toString());
if (msg.msg_type === 12) { // KEEP_ALIVE_REQ
signalingWs.send(
JSON.stringify({
msg_type: 13, // KEEP_ALIVE_RESP
timestamp: msg.timestamp,
}),
);
} else if (msg.msg_type === 2) { // SIGNALING_HAND_SHAKE_RESP
if (msg.status_code === 0) {
const mediaUrl = msg.media_server?.server_urls?.audio;
connectToMediaWebSocket(
mediaUrl,
session_id,
rtmsStreamId,
signalingWs,
);
}
}
});
When the signaling handshake succeeds (receiving msg_type: 2 with status_code: 0), the signaling connection extracts the media server URL and calls connectToMediaWebSocket.
Media connection
After the signaling handshake, the app opens a second WebSocket to the media server. It sends a handshake message specifying which types of media data it wants to receive. The flow is very similar to the first:
function connectToMediaWebSocket(mediaUrl: string, session_id: string, rtmsStreamId: string signalingSocket: WebSocket) {
const mediaWs = new WebSocket(mediaUrl, [], { rejectUnauthorized: false });
mediaWs.on("open", () => {
const handshakeMsg = {
msg_type: 3, // DATA_HAND_SHAKE_REQ
protocol_version: 1,
sequence: 0,
meeting_uuid: session_id,
rtms_stream_id: rtmsStreamId,
signature: generateSignature(session_id, rtmsStreamId),
media_type: 1, // AUDIO
payload_encryption: false,
media_params: {
audio: {
content_type: 1, //RTP
sample_rate: 1, //16k
channel: 1, //mono
codec: 1, //L16
data_opt: 1, //AUDIO_MIXED_STREAM
send_rate: 1000, //in Milliseconds
},
},
};
mediaWs.on("message", (data) => {
const msg = JSON.parse(data.toString());
if (msg.msg_type === 14) { // AUDIO DATA
if (msg.content?.data) {
const { data: audioData } =
msg.content as SampleAudioPacket["content"];
const buffer = Buffer.from(audioData, "base64");
void transcribeAudio(buffer);
}
} else if (msg.msg_type === 4 && msg.status_code === 0) { // DATA_HAND_SHAKE_RESP
signalingSocket.send(
JSON.stringify({
msg_type: 7, // CLIENT_READY_ACK
rtms_stream_id: rtmsStreamId,
}),
);
} else if (msg.msg_type === 12) { // KEEP_ALIVE_REQ
mediaWs.send(
JSON.stringify({
msg_type: 13, // KEEP_ALIVE_ACK
timestamp: msg.timestamp,
}),
);
}
});
mediaWs.send(JSON.stringify(handshakeMsg));
});
}
We get the audio date on the message event of the media WebSocket. Audio data is sent as uncompressed raw PCM (L16) data with a 16kHz sample rate and a mono channel. The media_type: 1 value indicates we want only audio. Learn more: media_data_type.
Processing media messages
Incoming packets on the media WebSocket are handled inline in server/index.ts. The message handler responds to:
- 4 (DATA_HAND_SHAKE_RESP): Media handshake response. If successful (
status_code === 0), the server sends aCLIENT_READY_ACK(msg_type: 7) to the signaling socket. - 12 (KEEP_ALIVE_REQ): Media server keep-alive ping. The client must reply with
KEEP_ALIVE_RESP(msg_type: 13) and the provided timestamp. - 14 (AUDIO): Audio packets encoded as base64 PCM data. The code decodes the payload, and passes the audio
BuffertotranscribeAudioto build up the audio stream for transcription.
The handler sends the required acknowledgements and buffers audio for Whisper transcription. We'll walk through the audio processing utilities in the next section.
Stopping an RTMS session
When the RTMS session ends, we clean up our connections:
else if (event === 'session.rtms_stopped') {
const { session_id } = payload;
console.log(`Stopping RTMS for Video session ${session_id}`);
}
This sample logs the stop event; you can expand it to close sockets and clean up state as needed.
Handling audio data for transcription
When we receive audio packets (msg_type: 14), we need to convert them from raw PCM format to WAV format for Whisper. The audio processing utilities live in server/util.ts.
Converting PCM to WAV
RTMS sends audio as raw PCM (L16) data at 16kHz mono in base64-encoded format. While Whisper model requires a wave file for transcription. We'll create a bufferToWaveFile function that wraps the raw buffer with a 44-byte WAV header and writes it to a file:
export const bufferToWaveFile = (buffer: Buffer<ArrayBuffer>) => {
const wavePath = path.join(process.cwd(), `audio_${Date.now()}.wav`);
const pcmData = buffer;
const header = Buffer.alloc(44);
const dataSize = pcmData.length;
const fileSize = dataSize + 36;
// RIFF chunk descriptor
header.write("RIFF", 0);
header.writeUInt32LE(fileSize, 4);
header.write("WAVE", 8);
// fmt sub-chunk: PCM format, mono, 16kHz, 16-bit
header.write("fmt ", 12);
header.writeUInt32LE(16, 16);
header.writeUInt16LE(1, 20);
header.writeUInt16LE(1, 22);
header.writeUInt32LE(16000, 24);
header.writeUInt32LE(32000, 28);
header.writeUInt16LE(2, 32);
header.writeUInt16LE(16, 34);
// data sub-chunk
header.write("data", 36);
header.writeUInt32LE(dataSize, 40);
fs.writeFileSync(wavePath, Buffer.concat([header, pcmData]));
return wavePath;
};
Buffering and transcribing
Since transcribing short clips can produce imperfect results, we buffer ~5-10 seconds of audio before transcribing:
let transcriptBuffer = Buffer.alloc(0);
const transcribeAudio = async (buffer: Buffer<ArrayBuffer>) => {
transcriptBuffer = Buffer.concat([transcriptBuffer, buffer]);
if (transcriptBuffer.length >= 16000 * 10) {
void getTranscriptFromBuffer(transcriptBuffer);
transcriptBuffer = Buffer.alloc(0);
}
};
This function accumulates audio buffers and, once the transcriptBuffer has ~10 seconds worth of audio, it converts the buffer to a WAV file and call getTranscriptFromBuffer to run Whisper transcription:
const getTranscriptFromBuffer = async (buffer: Buffer<ArrayBuffer>) => {
const bufferCopy = Buffer.from(buffer);
const wavePath = await bufferToWaveFile(bufferCopy);
const transcript = await whisper(wavePath, {
modelName: "base.en",
});
fs.unlinkSync(wavePath);
console.log(transcript);
};
We're using the (whisper-node)[https://www.npmjs.com/package/whisper-node] bindings to generate the transcript.
Running the server
Finally we start the HTTP server in index.ts file:
const server = http.createServer(app);
server.listen(port, () => {
console.log(`Server running at http://localhost:${port}`);
});
Start the RTMS session from your server
Now that the server is configured, you can use the REST API to start the RTMS streams:
fetch(`https://api.zoom.us/v2/videosdk/sessions/${sessionId}/rtms_app/status`, {
method: "PATCH",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer YOUR_SECRET_TOKEN",
},
body: JSON.stringify({
action: "start",
}),
});
Alternatively, you can also use the Video SDK RealTimeMediaStreamsClient object to start/stop the RTMS streams.
Conclusion
With RTMS and Video SDK, you can build server-side applications that process real-time media without running automated clients. Beyond local transcription, you could build real-time translation, speech analytics, or meeting summarization with your own AI models.
Check out the RTMS documentation and explore more samples in our rtms-samples repository.