Dynamic audio processing: how to add real time audio effects to your Zoom Video SDK app

With the release of Zoom Video SDK 2.1.5, we've added support for media processors. This allows you to modify a user's audio, video or screen share feed before it is sent to remote users. In this blog post, we'll show you how to use an audio processor to dynamically change the pitch of the user's voice in real-time.

Prerequisites

  • Node & NPM LTS
  • A Zoom Video SDK Account

We'll build on top of the Zoom Video SDK quickstart guide. If you're new to the SDK, we recommend checking out the quickstart guide first. You can clone that repo and follow the steps to get started:

git clone https://github.com/zoom/videosdk-web-helloworld

The completed code for this guide is available on GitHub.

Media processors

The media processor design is inspired by the AudioWorklet API. The processor runs within the AudioWorkletGlobalScope to enhance performance. To define custom audio processing logic, we'll create an audio processor to increase the pitch of the user's voice.

How does pitch shifting work?

The pitch of an audio track is determined by the frequency of its sound waves — essentially, how many times the wave oscillates per second. Increasing the frequency raises the pitch, while lowering it makes the sound deeper.

One simple way to raise the pitch is by speeding up playback. When you play audio faster, the sound waves oscillate more times per second, making the voice sound higher — like the classic "chipmunk effect."

Here's a simplified explanation of some of the audio jargon:

  • Frequency: The number of times the sound wave oscillates per second.
  • Pitch: How "high" or "low" a sound is, determined by its frequency.
  • Audio Sample: A single value that represents the amplitude of the sound wave at a specific point in time.
  • Buffer: A fixed-size array that stores audio samples.
  • Sample Rate: The number of audio samples captured or played per second.

Step 1: Create a pitch shift audio processor

To define an audio processor, we'll create a new file public/pitch-processor.js. We'll define a PitchShiftProcessor class that extends the AudioProcessor interface.

  1. The processor will input audio samples and store them in a circular buffer.
  2. We can read the audio values faster than the user's sample rate to increase the pitch.
  3. We'll pass these values through a filter to remove unwanted low sounds i.e. noise.
  4. We then mix the filtered and original audio together based on the dryWet ratio.
  5. We output the modified audio samples.

constructor

The constructor initializes the processor and sets up the circular buffer for pitch shifting. We initialize various buffer positions and timing parameters:

class PitchShiftProcessor extends AudioProcessor {
    constructor(port, options) {
        super(port, options);
        this.bufferSize = 11025;
        this.buffer = new Float32Array(this.bufferSize);
        this.writePos = 0;
        this.readPos = 0.0;
        this.pitchRatio = 1.5;
        this.dryWet = 0.7;
        this.hpf = {
            prevIn: 0,
            prevOut: 0,
            alpha: 0.86,
        };
    }
    // ...
}

process

The process function is called for every audio buffer. This is the main entry point where we handle the audio processing pipeline. We define the input and output audio channels from the inputs array. We check if the input channel is empty to return early. We read the input channel and write it to the circular buffer.

class PitchShiftProcessor extends AudioProcessor {
  // ...
    process(inputs, outputs) {
      const input = inputs[0];
      const output = outputs[0];
      if (input.length === 0 || !input[0]) return true
      const inputChannel = input[0];
      const outputChannel = output[0];
      for (let i = 0; i < inputChannel.length; i++) {
          this.buffer[this.writePos] = inputChannel[i];
          this.writePos = (this.writePos + 1) % this.bufferSize;
      }

Next, we read from the circular buffer at a different rate to achieve pitch shifting. The variable raw is calculated using linear interpolation between the current and the next buffer. This helps us to get a smoother transition.

    process(inputs, outputs) {
      // ...
      for (let i = 0; i < outputChannel.length; i++) {
        let readPos = this.readPos % this.bufferSize;
          if (readPos < 0) readPos += this.bufferSize;
          const intPos = Math.floor(readPos);
          const frac = readPos - intPos;
          const nextPos = (intPos + 1) % this.bufferSize;
          const raw = this.buffer[intPos] * (1 - frac) + this.buffer[nextPos] * frac;

We use a filter to remove unwanted low sounds. We blend the filtered and original audio together based on the dryWet ratio and send it to the outputChannel.

const filtered = raw - this.hpf.prevIn + this.hpf.alpha * this.hpf.prevOut;
this.hpf.prevIn = raw;
this.hpf.prevOut = filtered;
outputChannel[i] = filtered * this.dryWet + raw * (1 - this.dryWet);

We move the reading point forward by the set pitch ratio. If the reading point goes too far, it starts over from the beginning.

          this.readPos += this.pitchRatio;
          if (this.readPos >= this.bufferSize) {
              this.readPos -= this.bufferSize;
              this.writePos = 0;
          }
      }
      return true;
    }
}

We also have onInit and onUninit functions that are triggered when the processor initializes or shuts down. You can use these to allocate and release resources.

Now that we've defined the processor class, we need to register it with the SDK. This is done by calling the registerProcessor function with the processor name and the processor class:

class PitchShiftProcessor extends AudioProcessor {
    // ...
}
registerProcessor("pitch-shift-audio-processor", PitchShiftProcessor);

Step 3: Add the media processor to the Video SDK

To use the audio processor script within the Video SDK. In main.ts we check if the browser has support for audio processors using the isSupportAudioProcessor method on the mediaStream:

const startCall = async () => {
  // ...
  const client = ZoomVideo.createClient();
  const mediaStream = client.getMediaStream();
  if (!mediaStream.isSupportAudioProcessor()) {
    alert("Your browser does not support audio processor");
  }

We can then create a processor instance by calling the createProcessor method on the mediaStream:

const processor = await mediaStream.createProcessor({
    name: "pitch-shift-audio-processor",
    type: "audio",
    url: window.location.origin + "/pitch-processor.js",
});

We'll pass in a name for the processor and the type of the processor. The url specifies the script location; it must originate from the same domain or have the appropriate CORS headers.

We can add the processor to the audio stream pipeline using the addProcessor method. You can perform this operation before or after starting the audio.

await mediaStream.addProcessor(processor);

This changes the pitch of the user's voice in real-time, making it higher pitched. The pitch change is audible to all other remote participants as well. That's all the code you need to get basic pitch shifting working.

Next steps

Audio processors are extremely powerful for audio processing and modification. You could build processors for voice effects: Add reverb, echo, or distortion, voice masking: Implement voice anonymization, and audio enhancement: Noise reduction or audio quality improvement.

Conclusion

With just a few lines of code, you can create powerful custom audio processors with Zoom Video SDK. Beyond pitch shifting, you can experiment with voice effects, real-time audio analysis, or even voice synthesis.

To dive deeper, check out our raw-data documentation and explore the sample processor repo for more inspiration.