# Transcribe Video SDK sessions with Realtime Media Streams

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

With Realtime Media Streams (RTMS) [now available for Video SDK](/blog/realtime-media-streams-video-sdk), you can access per-participant audio streams over WebSockets and process them on your server. While RTMS also allows you to directly access the participant's transcript data, in this blog we'll showcase how to access real-time audio and transcribe it locally on your own server. You can learn more about Realtime Media Streams in our [RTMS docs](/docs/rtms/video-sdk/).

## Prerequisites

-   Node.js LTS
-   A Zoom Video SDK account with universal credit enabled
-   A service to tunnel your local server to a public URL (like [Ngrok](https://ngrok.com/))

## Enable RTMS for your Video SDK app

Before building, you need to configure event subscriptions in the Zoom App Marketplace to receive RTMS lifecycle events.

1. Sign into the [Zoom App Marketplace](https://marketplace.zoom.us/) with your Video SDK credentials
2. Navigate to **Develop** → **Build Video SDK**
3. Under **Add feature**, enable **Event Subscriptions**
4. Configure your subscription:
    - Add a descriptive name for your subscription
    - Add the **RTMS Started** and **RTMS Stopped** events
5. Set your **Event notification endpoint URL** - this is where Zoom will send webhook events when RTMS sessions start and stop
6. Save your configuration

You can use a service like ngrok to tunnel your local server to a public URL. Once installed you can run `ngrok http 3000`. Once configured, your server will receive webhook payloads when participants trigger RTMS in a Video SDK session:

## Building the application

You can find the completed project on [GitHub](https://github.com/EkaanshArora/videosdk-rtms-transcribe-audio). To follow along, create a new node project and initialize a `package.json` file:

```shell
npm init -y
```

Create a `.env` file in the root directory with your Zoom credentials:

```ini
VITE_SDK_KEY=your_client_id
VITE_SDK_SECRET=your_client_secret
ZOOM_SECRET_TOKEN=your_webhook_secret_token
PORT=3000
```

Install the dependencies:

```shell
npm install express dotenv ws whisper-node
```

Download the Whisper model:

```shell
npx whisper-node download
```

At a high level, the flow looks like this:

1. A webhook receives the RTMS start event from Zoom with connection details.
2. Our server establishes a signaling WebSocket connection and authenticates.
3. Upon successful handshake, we connect to the media WebSocket to receive audio data.
4. Audio packets are buffered and transcribed locally using Whisper.

## Sample app

We'll walk through the key components of the sample app to understand how RTMS works with Video SDK. In the [repo](https://github.com/EkaanshArora/videosdk-rtms-transcribe-audio), the server code lives under `server/` and is organized into a few files:

```plaintext
server/
├── index.ts       # Express server, webhook handler, RTMS signaling/media sockets
├── util.ts        # Audio WAV conversion + transcript formatting
└── whisper.d.ts   # Whisper typings for TypeScript
```

## Building the application

Let's walk through building the application step by step, following the structure of [index.ts](https://github.com/zoom/rtms-samples/blob/main/video-sdk/vsdk_transcription_js/index.ts).

### Setting up the Express server

First, we create our Express server and import the dependencies in `server/index.ts`.

```ts
import http from "http";
import crypto from "crypto";
import dotenv from "dotenv";
import WebSocket from "ws";
import whisper from "whisper-node";
import express from "express";
import {
    bufferToWaveFile,
    formatTranscript,
    type SampleAudioPacket,
    type SampleTranscript,
} from "./util";

dotenv.config({ quiet: true });
const PORT = process.env.PORT || 3000;
const ZoomSecretToken = process.env.ZOOM_SECRET_TOKEN as string;
const ZoomClientId = process.env.VITE_SDK_KEY as string;
const ZoomClientSecret = process.env.VITE_SDK_SECRET as string;

const app = express();
app.use(express.json());
```

We read Zoom credentials from environment variables and use them throughout the server for webhook validation and RTMS signatures.

To access the data stream from RTMS we can either use the [RTMS SDK](#using-the-sdk) or handle the [webhook & WebSocket connections manually](#handling-webhook-events).

### Using the SDK

You can use the [`rtms`](https://github.com/zoom/rtms) SDK to connect to RTMS streams. The SDK provides a simple interface to connect to RTMS streams and receive audio and transcript data.

```ts
import rtms from "@zoom/rtms";

rtms.onWebhookEvent(({ payload }) => {
    const client = new rtms.Client();
    client.setAudioParams({
        contentType: 2,
        codec: 1,
        sampleRate: 16000,
        channel: 1,
        dataOpt: 1,
        duration: 1000,
        frameSize: 16000,
    });
    client.onTranscriptData((data, size, timestamp, metadata) =>
        console.log(`${metadata.userName}: ${data}`),
    ); // transcript
    client.onAudioData((data) => console.log(data)); // ArrayBuffer of audio data
    client.join({ ...payload, client: ZoomClientId, secret: ZoomClientSecret });
});
```

You can directly access the transcript data from the `onTranscriptData` callback, but for transcribing the audio locally we'll use the `onAudioData` callback to get the audio data as an ArrayBuffer. If you're using the SDK you can skip to the [Handling audio data for transcription](#handling-audio-data-for-transcription) section. Read on to learn how the webhooks and websockets work under the hood.

### Handling webhook events

The server needs to listen for webhook events. We'll handle three scenarios: webhook validation, RTMS session start, and RTMS session stop.

#### Webhook validation

When you first configure your webhook URL in the Zoom Marketplace, Zoom validates it by sending a challenge:

```ts
app.post("/webhook", async (req, res) => {
    const { event, payload } = req.body;
    if (event === "endpoint.url_validation" && payload?.plainToken) {
        const hash = crypto
            .createHmac("sha256", ZoomSecretToken)
            .update(payload.plainToken)
            .digest("hex");
        return res.json({
            plainToken: payload.plainToken,
            encryptedToken: hash,
        });
    }
    res.sendStatus(200);
    // ... handle other events below
});
```

We hash the `plainToken` using our secret token and return it as `encryptedToken`. This proves we own the webhook endpoint.

#### Starting an RTMS session

When someone starts RTMS in a Video SDK session, Zoom sends a `session.rtms_started` event with connection details:

```ts
if (event === "session.rtms_started") {
    const { session_id, rtms_stream_id, server_urls } = payload;
    console.log("Starting RTMS for session:", { payload });
    connectToSignalingWebSocket(session_id, rtms_stream_id, server_urls);
}
// ...
```

The payload contains:

-   `session_id` - Unique identifier for the Video SDK session
-   `rtms_stream_id` - Unique identifier for this RTMS stream
-   `server_urls` - Array of WebSocket URLs to connect to

We call `connectToSignalingWebSocket` with these values to establish the first WebSocket connection.

## WebSocket connections

RTMS uses two WebSocket connections: one for signaling and one for media data. Here are the events we will use for our app:

| msg_type | Name                        | Description                  |
| -------- | --------------------------- | ---------------------------- |
| 1        | `SIGNALING_HAND_SHAKE_REQ`  | Signaling handshake request  |
| 2        | `SIGNALING_HAND_SHAKE_RESP` | Signaling handshake response |
| 3        | `DATA_HAND_SHAKE_REQ`       | Media handshake request      |
| 4        | `DATA_HAND_SHAKE_RESP`      | Media handshake response     |
| 7        | `CLIENT_READY_ACK`          | Client ready acknowledgement |
| 12       | `KEEP_ALIVE_REQ`            | Keep-alive request           |
| 13       | `KEEP_ALIVE_RESP`           | Keep-alive response          |
| 14       | `AUDIO`                     | Audio data packet            |

Both connections require an HMAC-SHA256 signature for authentication using the format `CLIENT_ID,session_id,rtms_stream_id` signed with your client secret. This is implemented in [utils/rtms.ts](src/utils/rtms.ts):

```ts
function generateSignature(sessionID: string, streamId: string): string {
    const message = `${ZoomClientId},${sessionID},${streamId}`;
    return crypto
        .createHmac("sha256", ZoomClientSecret)
        .update(message)
        .digest("hex");
}
```

### Signaling connection

The `connectToSignalingWebSocket` function (in `server/index.ts`) establishes the signaling connection and sends the initial handshake. Here's a simplified version:

```ts
function connectToSignalingWebSocket(session_id: string, rtmsStreamId: string, serverUrls: string ) {
  const signalingWs = new WebSocket(serverUrls, [], {rejectUnauthorized: false});
  signalingWs.on("open", () => {
      signalingWs.send(
        JSON.stringify({
        msg_type: 1,
        meeting_uuid: session_id,
        session_id,
        rtms_stream_id: rtmsStreamId,
        signature: generateSignature(session_id, rtmsStreamId),
      }))
  });

  signalingWs.on("message", (data) => {
    const msg = JSON.parse(data.toString());
    if (msg.msg_type === 12) { // KEEP_ALIVE_REQ
      signalingWs.send(
        JSON.stringify({
          msg_type: 13, // KEEP_ALIVE_RESP
          timestamp: msg.timestamp,
        }),
      );
    } else if (msg.msg_type === 2) { // SIGNALING_HAND_SHAKE_RESP
      if (msg.status_code === 0) {
        const mediaUrl = msg.media_server?.server_urls?.audio;
        connectToMediaWebSocket(
          mediaUrl,
          session_id,
          rtmsStreamId,
          signalingWs,
        );
      }
    }
  });
```

When the signaling handshake succeeds (receiving `msg_type: 2` with `status_code: 0`), the signaling connection extracts the media server URL and calls `connectToMediaWebSocket`.

### Media connection

After the signaling handshake, the app opens a second WebSocket to the media server. It sends a handshake message specifying which types of media data it wants to receive. The flow is very similar to the first:

```ts
function connectToMediaWebSocket(mediaUrl: string, session_id: string, rtmsStreamId: string signalingSocket: WebSocket) {
  const mediaWs = new WebSocket(mediaUrl, [], { rejectUnauthorized: false });

  mediaWs.on("open", () => {
    const handshakeMsg = {
      msg_type: 3, // DATA_HAND_SHAKE_REQ
      protocol_version: 1,
      sequence: 0,
      meeting_uuid: session_id,
      rtms_stream_id: rtmsStreamId,
      signature: generateSignature(session_id, rtmsStreamId),
      media_type: 1, // AUDIO
      payload_encryption: false,
      media_params: {
        audio: {
          content_type: 1, //RTP
          sample_rate: 1, //16k
          channel: 1, //mono
          codec: 1, //L16
          data_opt: 1, //AUDIO_MIXED_STREAM
          send_rate: 1000, //in Milliseconds
        },
      },
    };

    mediaWs.on("message", (data) => {
      const msg = JSON.parse(data.toString());
      if (msg.msg_type === 14) { // AUDIO DATA
        if (msg.content?.data) {
          const { data: audioData } =
            msg.content as SampleAudioPacket["content"];
          const buffer = Buffer.from(audioData, "base64");
          void transcribeAudio(buffer);
        }
      } else if (msg.msg_type === 4 && msg.status_code === 0) { // DATA_HAND_SHAKE_RESP
        signalingSocket.send(
          JSON.stringify({
            msg_type: 7, // CLIENT_READY_ACK
            rtms_stream_id: rtmsStreamId,
          }),
        );
      } else if (msg.msg_type === 12) { // KEEP_ALIVE_REQ
        mediaWs.send(
          JSON.stringify({
            msg_type: 13, // KEEP_ALIVE_ACK
            timestamp: msg.timestamp,
          }),
        );
      }
    });
    mediaWs.send(JSON.stringify(handshakeMsg));
  });
}
```

We get the audio date on the `message` event of the media WebSocket. Audio data is sent as uncompressed raw PCM (L16) data with a 16kHz sample rate and a mono channel. The `media_type: 1` value indicates we want only audio. Learn more: [media_data_type](/docs/rtms/data-types/).

### Processing media messages

Incoming packets on the media WebSocket are handled inline in `server/index.ts`. The message handler responds to:

-   **4 (DATA_HAND_SHAKE_RESP)**: Media handshake response. If successful (`status_code === 0`), the server sends a `CLIENT_READY_ACK` (`msg_type: 7`) to the signaling socket.
-   **12 (KEEP_ALIVE_REQ)**: Media server keep-alive ping. The client must reply with `KEEP_ALIVE_RESP` (`msg_type: 13`) and the provided timestamp.
-   **14 (AUDIO)**: Audio packets encoded as base64 PCM data. The code decodes the payload, and passes the audio `Buffer` to `transcribeAudio` to build up the audio stream for transcription.

The handler sends the required acknowledgements and buffers audio for Whisper transcription. We'll walk through the audio processing utilities in the next section.

#### Stopping an RTMS session

When the RTMS session ends, we clean up our connections:

```ts
else if (event === 'session.rtms_stopped') {
  const { session_id } = payload;
  console.log(`Stopping RTMS for Video session ${session_id}`);
}
```

This sample logs the stop event; you can expand it to close sockets and clean up state as needed.

## Handling audio data for transcription

When we receive audio packets (`msg_type: 14`), we need to convert them from raw PCM format to WAV format for Whisper. The audio processing utilities live in `server/util.ts`.

### Converting PCM to WAV

RTMS sends audio as raw PCM (L16) data at 16kHz mono in base64-encoded format. While Whisper model requires a wave file for transcription. We'll create a `bufferToWaveFile` function that wraps the raw buffer with a 44-byte WAV header and writes it to a file:

```ts
export const bufferToWaveFile = (buffer: Buffer<ArrayBuffer>) => {
    const wavePath = path.join(process.cwd(), `audio_${Date.now()}.wav`);
    const pcmData = buffer;
    const header = Buffer.alloc(44);
    const dataSize = pcmData.length;
    const fileSize = dataSize + 36;

    // RIFF chunk descriptor
    header.write("RIFF", 0);
    header.writeUInt32LE(fileSize, 4);
    header.write("WAVE", 8);

    // fmt sub-chunk: PCM format, mono, 16kHz, 16-bit
    header.write("fmt ", 12);
    header.writeUInt32LE(16, 16);
    header.writeUInt16LE(1, 20);
    header.writeUInt16LE(1, 22);
    header.writeUInt32LE(16000, 24);
    header.writeUInt32LE(32000, 28);
    header.writeUInt16LE(2, 32);
    header.writeUInt16LE(16, 34);

    // data sub-chunk
    header.write("data", 36);
    header.writeUInt32LE(dataSize, 40);

    fs.writeFileSync(wavePath, Buffer.concat([header, pcmData]));
    return wavePath;
};
```

### Buffering and transcribing

Since transcribing short clips can produce imperfect results, we buffer ~5-10 seconds of audio before transcribing:

```ts
let transcriptBuffer = Buffer.alloc(0);

const transcribeAudio = async (buffer: Buffer<ArrayBuffer>) => {
    transcriptBuffer = Buffer.concat([transcriptBuffer, buffer]);
    if (transcriptBuffer.length >= 16000 * 10) {
        void getTranscriptFromBuffer(transcriptBuffer);
        transcriptBuffer = Buffer.alloc(0);
    }
};
```

This function accumulates audio buffers and, once the `transcriptBuffer` has ~10 seconds worth of audio, it converts the buffer to a WAV file and call `getTranscriptFromBuffer` to run Whisper transcription:

```ts
const getTranscriptFromBuffer = async (buffer: Buffer<ArrayBuffer>) => {
    const bufferCopy = Buffer.from(buffer);
    const wavePath = await bufferToWaveFile(bufferCopy);
    const transcript = await whisper(wavePath, {
        modelName: "base.en",
    });
    fs.unlinkSync(wavePath);
    console.log(transcript);
};
```

We're using the (whisper-node)[https://www.npmjs.com/package/whisper-node] bindings to generate the transcript.

### Running the server

Finally we start the HTTP server in `index.ts` file:

```ts
const server = http.createServer(app);

server.listen(port, () => {
    console.log(`Server running at http://localhost:${port}`);
});
```

### Start the RTMS session from your server

Now that the server is configured, you can use the [REST API](/docs/api/video-sdk/#tag/sessions/patch/videosdk/sessions/%7BsessionId%7D/rtms_app/status) to start the RTMS streams:

```ts
fetch(`https://api.zoom.us/v2/videosdk/sessions/${sessionId}/rtms_app/status`, {
    method: "PATCH",
    headers: {
        "Content-Type": "application/json",
        Authorization: "Bearer YOUR_SECRET_TOKEN",
    },
    body: JSON.stringify({
        action: "start",
    }),
});
```

Alternatively, you can also use the Video SDK [`RealTimeMediaStreamsClient`](https://marketplacefront.zoom.us/sdk/custom/web/modules/ZoomVideo.RealTimeMediaStreamsClient.html) object to start/stop the RTMS streams.

## Conclusion

With RTMS and Video SDK, you can build server-side applications that process real-time media without running automated clients. Beyond local transcription, you could build real-time translation, speech analytics, or meeting summarization with your own AI models.

Check out the [RTMS documentation](/docs/rtms/) and explore more samples in our [rtms-samples repository](https://github.com/zoom/rtms-samples).