Build real-time audio and video deepfake detection with Zoom RTMS

With the April 2026 RTMS release, developers can now subscribe to individual video streams in addition to individual audio streams. That means your app can choose a participant, stream their meeting media to your backend, prepare the audio and video in the format your models expect, and send the results back to the user while the meeting is still happening.

In this demo, Zoom Realtime Media Streams (RTMS) sends live meeting audio and video to Hugging Face deepfake detection models. The inference results are then displayed directly inside a Zoom App.

You can use this same pattern with your own models, or build and publish a Zoom App that provides deepfake detection as a service for your users.

Why individual streams matter

If you are trying to analyze one person, active speaker video and individual video streams behave differently.

Active speaker video follows whoever is speaking at that moment. Individual video streams are useful when your app needs to keep analyzing the same participant.

Individual video streams let your app subscribe to a specific participant. The sample uses VIDEO_SINGLE_INDIVIDUAL_STREAM for video and AUDIO_MULTI_STREAMS for audio. Video is explicitly subscribed to one participant. Audio is filtered to the same selected RTMS user_id.

That keeps the detection pipeline grounded in the person you selected.

Here is a demo of the sample app streaming individual video and audio for deepfake detection:

Video demo

What the sample builds

The sample app has two main parts:

A Zoom App frontend
A Node.js Express backend

The frontend uses the Zoom Apps SDK to start and stop RTMS, list meeting participants, select a participant, and display detection results. It does not connect directly to RTMS media sockets.

The backend manages the RTMS connection. It receives meeting.rtms_started and meeting.rtms_stopped webhooks, opens the signaling and media connections, subscribes to the selected participant's individual video stream, filters audio packets, prepares media clips for the models, and sends those clips to external inference services.

The high-level flow looks like this:

Zoom Meeting
  -> Zoom App starts RTMS
  -> Backend receives RTMS webhooks
  -> Backend connects to RTMS signaling and media sockets
  -> App selects a participant
  -> Backend subscribes to that participant's video
  -> Backend filters audio for the same participant
  -> Video and audio are converted into short clips
  -> Clips are sent to deepfake detection services
  -> Results are displayed inside the Zoom App

The detection models

This demo uses two Hugging Face models:

Video detection: Naman712/Deep-fake-detection
Audio detection: MelodyMachine/Deepfake-audio-detection-V2

The video model expects MP4 clips, and the audio detection service expects audio files such as WAV. Your own model might have different format, duration, or sampling requirements.

The models are used as-is. They have not been fine-tuned or validated for production accuracy in this sample.

You are not limited to these models. You can replace the Hugging Face examples with your own model, a hosted inference API, or an internal detection service that fits your accuracy and latency requirements.

The media conversion layer

RTMS gives your app live media. Your model determines the format.

In the sample, the backend converts H264 video into short MP4 clips using ffmpeg. The default setup builds 2-second clips at 5 FPS before sending them to the video detection service.

Audio is handled separately. The sample builds rolling 4-second audio windows from RTMS PCM audio and sends them to the audio detection service.

If your model expects a different format, this is the part you would customize.

After the detection services return a result, the backend sends the video and audio scores back to the Zoom App frontend. The app then displays the latest detection status for the selected participant inside the meeting.

Quick setup summary

The sample is available in the zoom/rtms-samples repository. To try it, clone the repo, install the Node.js dependencies, configure your Zoom app credentials and RTMS webhook secret, then point the video and audio detection settings to your inference services.

Once the app is running, the workflow is:

Start a Zoom Meeting.
Open the Zoom App.
Start RTMS.
Select a participant.
Load that participant's individual video stream.
Start video verification.
Start audio verification.
Watch the results update inside the Zoom App.

More individual stream use cases

Deepfake detection is one example. The same RTMS pattern can support other real-time media analysis workflows:

KYC and identity verification
Facial analytics
Emotion detection

The parts you will probably tune first are the clip duration, encoding format, audio sampling rate, detection threshold, model workflow, latency target, and how results are displayed to users.

Start building

The full sample is available on GitHub:

github.com/zoom/rtms-samples/tree/main/zoom_apps/stream_audio_and_video_deepfake_detection_js

For the RTMS release details, see the RTMS JavaScript SDK v1.1.0 changelog. If you need help, join the Zoom Developer Community.