Hi,
I have a working project (well almost) running on an ESP32-S3 with 8MB PSRAM and 320kb internal RAM.
On Core 0 I’m doing WiFi, HTTP client, OTA, LCD Display management, Microphone, Websockets, ESP-SR (wake word detection), basically all management.
On Core 1 I have two tasks, one to fill up a buffer for audio output and one to actually play the audio.
I figured out that in order to be able to play the filled audio without lagging/interruptions I need to process the audio and the received stream buffer in internal RAM (not PSRAM).
My resources on ESP32-S3 are not enough. I can’t move most of the stuff to PSRAM because it needs the internal RAM. Not enough Heap.
So everything works, even the audio playback but with lagging. The PSRAM is too slow for such operations.
In this situation, would you upgrade to the ESP32-P4-WIFI by Waveshare or do you see another option?
EDIT: I know that I could write the full stream to PSRAM and then start playing after it finishes but that wouldn’t be the real deal. I want responsiveness.


At max I just need a few MB for reading a websocket response into internal RAM and to feed the audio loop that is running on Core 1. The issue is that the Network delivers at around 25 KB/s but the audio playback consumes 48 KB/s (buffer underrun). I can’t lower the sample rate (I tried). I’d change to another codec like Opus but Deepgram API does only support PCM (linear) at 24 KHz. I tried setting other output formats but it’s not working. Technically I could decode Opus.
The flow is this: TTS -> Websocket -> PSRAM (slow) -> I2S (DMA 8x1024) -> DAC -> Speaker
DRAM free about 50 kb, PSRAM plenty (6-7 MB)
You probably only want to use the websocket as a control point, and have another socket open to receive an audio as passthrough? Pretty sure that’s how Sonos et al do it. More lightweight, and you don’t have to worry about overruns like you’re dealing with now perhaps.
The connection is bi-directional with different states.