Transcribe Video SDK sessions with Realtime Media Streams

With Realtime Media Streams (RTMS) now available for Video SDK, you can access per-participant audio streams over WebSockets and process them on your server. While RTMS also allows you to directly access the participant's transcript data, in this blog we'll showcase how to access real-time audio and transcribe it locally on your own server. You can learn more about Realtime Media Streams in our RTMS docs.

Prerequisites

  • Node.js LTS
  • A Zoom Video SDK account with universal credit enabled
  • A service to tunnel your local server to a public URL (like Ngrok)

Enable RTMS for your Video SDK app

Before building, you need to configure event subscriptions in the Zoom App Marketplace to receive RTMS lifecycle events.

  1. Sign into the Zoom App Marketplace with your Video SDK credentials
  2. Navigate to DevelopBuild Video SDK
  3. Under Add feature, enable Event Subscriptions
  4. Configure your subscription:
    • Add a descriptive name for your subscription
    • Add the RTMS Started and RTMS Stopped events
  5. Set your Event notification endpoint URL - this is where Zoom will send webhook events when RTMS sessions start and stop
  6. Save your configuration

You can use a service like ngrok to tunnel your local server to a public URL. Once installed you can run ngrok http 3000. Once configured, your server will receive webhook payloads when participants trigger RTMS in a Video SDK session:

Building the application

You can find the completed project on GitHub. To follow along, create a new node project and initialize a package.json file:

npm init -y

Create a .env file in the root directory with your Zoom credentials:

VITE_SDK_KEY=your_client_id
VITE_SDK_SECRET=your_client_secret
ZOOM_SECRET_TOKEN=your_webhook_secret_token
PORT=3000

Install the dependencies:

npm install express dotenv ws whisper-node

Download the Whisper model:

npx whisper-node download

At a high level, the flow looks like this:

  1. A webhook receives the RTMS start event from Zoom with connection details.
  2. Our server establishes a signaling WebSocket connection and authenticates.
  3. Upon successful handshake, we connect to the media WebSocket to receive audio data.
  4. Audio packets are buffered and transcribed locally using Whisper.

Sample app

We'll walk through the key components of the sample app to understand how RTMS works with Video SDK. In the repo, the server code lives under server/ and is organized into a few files:

server/
├── index.ts       # Express server, webhook handler, RTMS signaling/media sockets
├── util.ts        # Audio WAV conversion + transcript formatting
└── whisper.d.ts   # Whisper typings for TypeScript

Building the application

Let's walk through building the application step by step, following the structure of index.ts.

Setting up the Express server

First, we create our Express server and import the dependencies in server/index.ts.

import http from "http";
import crypto from "crypto";
import dotenv from "dotenv";
import WebSocket from "ws";
import whisper from "whisper-node";
import express from "express";
import {
    bufferToWaveFile,
    formatTranscript,
    type SampleAudioPacket,
    type SampleTranscript,
} from "./util";
dotenv.config({ quiet: true });
const PORT = process.env.PORT || 3000;
const ZoomSecretToken = process.env.ZOOM_SECRET_TOKEN as string;
const ZoomClientId = process.env.VITE_SDK_KEY as string;
const ZoomClientSecret = process.env.VITE_SDK_SECRET as string;
const app = express();
app.use(express.json());

We read Zoom credentials from environment variables and use them throughout the server for webhook validation and RTMS signatures.

To access the data stream from RTMS we can either use the RTMS SDK or handle the webhook & WebSocket connections manually.

Using the SDK

You can use the rtms SDK to connect to RTMS streams. The SDK provides a simple interface to connect to RTMS streams and receive audio and transcript data.

import rtms from "@zoom/rtms";
rtms.onWebhookEvent(({ payload }) => {
    const client = new rtms.Client();
    client.setAudioParams({
        contentType: 2,
        codec: 1,
        sampleRate: 16000,
        channel: 1,
        dataOpt: 1,
        duration: 1000,
        frameSize: 16000,
    });
    client.onTranscriptData((data, size, timestamp, metadata) =>
        console.log(`${metadata.userName}: ${data}`),
    ); // transcript
    client.onAudioData((data) => console.log(data)); // ArrayBuffer of audio data
    client.join({ ...payload, client: ZoomClientId, secret: ZoomClientSecret });
});

You can directly access the transcript data from the onTranscriptData callback, but for transcribing the audio locally we'll use the onAudioData callback to get the audio data as an ArrayBuffer. If you're using the SDK you can skip to the Handling audio data for transcription section. Read on to learn how the webhooks and websockets work under the hood.

Handling webhook events

The server needs to listen for webhook events. We'll handle three scenarios: webhook validation, RTMS session start, and RTMS session stop.

Webhook validation

When you first configure your webhook URL in the Zoom Marketplace, Zoom validates it by sending a challenge:

app.post("/webhook", async (req, res) => {
    const { event, payload } = req.body;
    if (event === "endpoint.url_validation" && payload?.plainToken) {
        const hash = crypto
            .createHmac("sha256", ZoomSecretToken)
            .update(payload.plainToken)
            .digest("hex");
        return res.json({
            plainToken: payload.plainToken,
            encryptedToken: hash,
        });
    }
    res.sendStatus(200);
    // ... handle other events below
});

We hash the plainToken using our secret token and return it as encryptedToken. This proves we own the webhook endpoint.

Starting an RTMS session

When someone starts RTMS in a Video SDK session, Zoom sends a session.rtms_started event with connection details:

if (event === "session.rtms_started") {
    const { session_id, rtms_stream_id, server_urls } = payload;
    console.log("Starting RTMS for session:", { payload });
    connectToSignalingWebSocket(session_id, rtms_stream_id, server_urls);
}
// ...

The payload contains:

  • session_id - Unique identifier for the Video SDK session
  • rtms_stream_id - Unique identifier for this RTMS stream
  • server_urls - Array of WebSocket URLs to connect to

We call connectToSignalingWebSocket with these values to establish the first WebSocket connection.

WebSocket connections

RTMS uses two WebSocket connections: one for signaling and one for media data. Here are the events we will use for our app:

msg_typeNameDescription
1SIGNALING_HAND_SHAKE_REQSignaling handshake request
2SIGNALING_HAND_SHAKE_RESPSignaling handshake response
3DATA_HAND_SHAKE_REQMedia handshake request
4DATA_HAND_SHAKE_RESPMedia handshake response
7CLIENT_READY_ACKClient ready acknowledgement
12KEEP_ALIVE_REQKeep-alive request
13KEEP_ALIVE_RESPKeep-alive response
14AUDIOAudio data packet

Both connections require an HMAC-SHA256 signature for authentication using the format CLIENT_ID,session_id,rtms_stream_id signed with your client secret. This is implemented in utils/rtms.ts:

function generateSignature(sessionID: string, streamId: string): string {
    const message = `${ZoomClientId},${sessionID},${streamId}`;
    return crypto
        .createHmac("sha256", ZoomClientSecret)
        .update(message)
        .digest("hex");
}

Signaling connection

The connectToSignalingWebSocket function (in server/index.ts) establishes the signaling connection and sends the initial handshake. Here's a simplified version:

function connectToSignalingWebSocket(session_id: string, rtmsStreamId: string, serverUrls: string ) {
  const signalingWs = new WebSocket(serverUrls, [], {rejectUnauthorized: false});
  signalingWs.on("open", () => {
      signalingWs.send(
        JSON.stringify({
        msg_type: 1,
        meeting_uuid: session_id,
        session_id,
        rtms_stream_id: rtmsStreamId,
        signature: generateSignature(session_id, rtmsStreamId),
      }))
  });
  signalingWs.on("message", (data) => {
    const msg = JSON.parse(data.toString());
    if (msg.msg_type === 12) { // KEEP_ALIVE_REQ
      signalingWs.send(
        JSON.stringify({
          msg_type: 13, // KEEP_ALIVE_RESP
          timestamp: msg.timestamp,
        }),
      );
    } else if (msg.msg_type === 2) { // SIGNALING_HAND_SHAKE_RESP
      if (msg.status_code === 0) {
        const mediaUrl = msg.media_server?.server_urls?.audio;
        connectToMediaWebSocket(
          mediaUrl,
          session_id,
          rtmsStreamId,
          signalingWs,
        );
      }
    }
  });

When the signaling handshake succeeds (receiving msg_type: 2 with status_code: 0), the signaling connection extracts the media server URL and calls connectToMediaWebSocket.

Media connection

After the signaling handshake, the app opens a second WebSocket to the media server. It sends a handshake message specifying which types of media data it wants to receive. The flow is very similar to the first:

function connectToMediaWebSocket(mediaUrl: string, session_id: string, rtmsStreamId: string signalingSocket: WebSocket) {
  const mediaWs = new WebSocket(mediaUrl, [], { rejectUnauthorized: false });
  mediaWs.on("open", () => {
    const handshakeMsg = {
      msg_type: 3, // DATA_HAND_SHAKE_REQ
      protocol_version: 1,
      sequence: 0,
      meeting_uuid: session_id,
      rtms_stream_id: rtmsStreamId,
      signature: generateSignature(session_id, rtmsStreamId),
      media_type: 1, // AUDIO
      payload_encryption: false,
      media_params: {
        audio: {
          content_type: 1, //RTP
          sample_rate: 1, //16k
          channel: 1, //mono
          codec: 1, //L16
          data_opt: 1, //AUDIO_MIXED_STREAM
          send_rate: 1000, //in Milliseconds
        },
      },
    };
    mediaWs.on("message", (data) => {
      const msg = JSON.parse(data.toString());
      if (msg.msg_type === 14) { // AUDIO DATA
        if (msg.content?.data) {
          const { data: audioData } =
            msg.content as SampleAudioPacket["content"];
          const buffer = Buffer.from(audioData, "base64");
          void transcribeAudio(buffer);
        }
      } else if (msg.msg_type === 4 && msg.status_code === 0) { // DATA_HAND_SHAKE_RESP
        signalingSocket.send(
          JSON.stringify({
            msg_type: 7, // CLIENT_READY_ACK
            rtms_stream_id: rtmsStreamId,
          }),
        );
      } else if (msg.msg_type === 12) { // KEEP_ALIVE_REQ
        mediaWs.send(
          JSON.stringify({
            msg_type: 13, // KEEP_ALIVE_ACK
            timestamp: msg.timestamp,
          }),
        );
      }
    });
    mediaWs.send(JSON.stringify(handshakeMsg));
  });
}

We get the audio date on the message event of the media WebSocket. Audio data is sent as uncompressed raw PCM (L16) data with a 16kHz sample rate and a mono channel. The media_type: 1 value indicates we want only audio. Learn more: media_data_type.

Processing media messages

Incoming packets on the media WebSocket are handled inline in server/index.ts. The message handler responds to:

  • 4 (DATA_HAND_SHAKE_RESP): Media handshake response. If successful (status_code === 0), the server sends a CLIENT_READY_ACK (msg_type: 7) to the signaling socket.
  • 12 (KEEP_ALIVE_REQ): Media server keep-alive ping. The client must reply with KEEP_ALIVE_RESP (msg_type: 13) and the provided timestamp.
  • 14 (AUDIO): Audio packets encoded as base64 PCM data. The code decodes the payload, and passes the audio Buffer to transcribeAudio to build up the audio stream for transcription.

The handler sends the required acknowledgements and buffers audio for Whisper transcription. We'll walk through the audio processing utilities in the next section.

Stopping an RTMS session

When the RTMS session ends, we clean up our connections:

else if (event === 'session.rtms_stopped') {
  const { session_id } = payload;
  console.log(`Stopping RTMS for Video session ${session_id}`);
}

This sample logs the stop event; you can expand it to close sockets and clean up state as needed.

Handling audio data for transcription

When we receive audio packets (msg_type: 14), we need to convert them from raw PCM format to WAV format for Whisper. The audio processing utilities live in server/util.ts.

Converting PCM to WAV

RTMS sends audio as raw PCM (L16) data at 16kHz mono in base64-encoded format. While Whisper model requires a wave file for transcription. We'll create a bufferToWaveFile function that wraps the raw buffer with a 44-byte WAV header and writes it to a file:

export const bufferToWaveFile = (buffer: Buffer<ArrayBuffer>) => {
    const wavePath = path.join(process.cwd(), `audio_${Date.now()}.wav`);
    const pcmData = buffer;
    const header = Buffer.alloc(44);
    const dataSize = pcmData.length;
    const fileSize = dataSize + 36;
    // RIFF chunk descriptor
    header.write("RIFF", 0);
    header.writeUInt32LE(fileSize, 4);
    header.write("WAVE", 8);
    // fmt sub-chunk: PCM format, mono, 16kHz, 16-bit
    header.write("fmt ", 12);
    header.writeUInt32LE(16, 16);
    header.writeUInt16LE(1, 20);
    header.writeUInt16LE(1, 22);
    header.writeUInt32LE(16000, 24);
    header.writeUInt32LE(32000, 28);
    header.writeUInt16LE(2, 32);
    header.writeUInt16LE(16, 34);
    // data sub-chunk
    header.write("data", 36);
    header.writeUInt32LE(dataSize, 40);
    fs.writeFileSync(wavePath, Buffer.concat([header, pcmData]));
    return wavePath;
};

Buffering and transcribing

Since transcribing short clips can produce imperfect results, we buffer ~5-10 seconds of audio before transcribing:

let transcriptBuffer = Buffer.alloc(0);
const transcribeAudio = async (buffer: Buffer<ArrayBuffer>) => {
    transcriptBuffer = Buffer.concat([transcriptBuffer, buffer]);
    if (transcriptBuffer.length >= 16000 * 10) {
        void getTranscriptFromBuffer(transcriptBuffer);
        transcriptBuffer = Buffer.alloc(0);
    }
};

This function accumulates audio buffers and, once the transcriptBuffer has ~10 seconds worth of audio, it converts the buffer to a WAV file and call getTranscriptFromBuffer to run Whisper transcription:

const getTranscriptFromBuffer = async (buffer: Buffer<ArrayBuffer>) => {
    const bufferCopy = Buffer.from(buffer);
    const wavePath = await bufferToWaveFile(bufferCopy);
    const transcript = await whisper(wavePath, {
        modelName: "base.en",
    });
    fs.unlinkSync(wavePath);
    console.log(transcript);
};

We're using the (whisper-node)[https://www.npmjs.com/package/whisper-node] bindings to generate the transcript.

Running the server

Finally we start the HTTP server in index.ts file:

const server = http.createServer(app);
server.listen(port, () => {
    console.log(`Server running at http://localhost:${port}`);
});

Start the RTMS session from your server

Now that the server is configured, you can use the REST API to start the RTMS streams:

fetch(`https://api.zoom.us/v2/videosdk/sessions/${sessionId}/rtms_app/status`, {
    method: "PATCH",
    headers: {
        "Content-Type": "application/json",
        Authorization: "Bearer YOUR_SECRET_TOKEN",
    },
    body: JSON.stringify({
        action: "start",
    }),
});

Alternatively, you can also use the Video SDK RealTimeMediaStreamsClient object to start/stop the RTMS streams.

Conclusion

With RTMS and Video SDK, you can build server-side applications that process real-time media without running automated clients. Beyond local transcription, you could build real-time translation, speech analytics, or meeting summarization with your own AI models.

Check out the RTMS documentation and explore more samples in our rtms-samples repository.